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Abstract 


\  i 

'-SiThe  object  of  our  research  ii  the  codification  of  programming  knowledge  for  the  synthesis  of 
concurrent  programs.  This  final  report  presents  the  derivation  of  two  concurrent  algorithms:  dynamic 
programming  (for  the  class  of  problems  that  run  in  polynomial  time  on  sequential  machines)  and 
array  multiplication.  Both  derived  concurrent  versions  run  in  linear  time.  The  concurrent  versions 
are  significant  and  complex  algorithms,  though  they  are  not  new  and  already  have  been  reported  in 
the  literature.  The  synthesis  knowledge  for  these  derivations  is  embodied  in  seven  synthesis  rules, 
preliminary  versions  of  which  are  presented  in  this  report.  The  rules  will  probably  generalise  to 
other  classes  of  algorithms  but  we  have  not  explored  that  issue  yet. 

VVe  have  also  discovered  a  pair  of  techniques  called  virtualization  and  aggregation.  This  pair  of 
techniques  (plus  the  other  seven  rules)  is  shown  to  be  powerful  enough  to  synthesise  Kung’s  systolic 
array  architecture  [Kung-76]  from  a  specification  of  matrix  multiplication. 


Introduction 


In  this  paper  we  describe  methods  for  synthesising  parallel  structures  from  concise,  very  high  level 
specifications  of  algorithms.  We  use  the  very  high  level  language,  V,  which  can  express  both  programs 
and  program  transformation  rules.  In  order  to  allow  for  reasoning  about  concurrency,  we  have 
defined  language  constructs  to  express  parallel  structures  that  solve  a  clast  of  problems.  In  these 
problems  the  number  of  processors  is  a  noneonstant  polynomial  in  some  measure  of  the  problem 
site.  We  then  developed  rules,  or  abstract  input/output  specifications,  that  transform  specifications 
of  sequential  algorithms  written  in  V  into  parallel  structures  that  accomplish  the  same  tasks.  We 
have  coded  some  of  them,  in  V. 

First,  rules  operate  on  specifications  by  identifying  processing  that  can  be  performed  concurrently  on 
distinct  elements  of  arrays  that  describe  either  the  problem,  its  solution,  or  some  intermediate  results. 
They  then  add  specifications  of  multiple  processors,  each  with  responsibility  for  a  portion  of  the  input 
data,  and  a  specification  of  the  interactions  among  the  processors.  Next,  other  rules  r^OMe  the  degree 
of  interconnection  between  the  processors  whenever  that  degree  is  not  asymptotically  constant  but  is 
polynomial  in  the  size  of  the  problem.  We  apply  these  rules  to  a  subclass  of  dynamic  programming 
specifications  and  to  a  specification  of  matrix  multiplication,  and  have  derived  asymptotically  fast, 
sparsely  interconnected  networks. 

We  have  also  developed  techniques  to  create  Kung’s  systolic  array  parallel  structure  from  a 
specification  of  matrix  multiplication.  We  have  identified  and  formalized  a  powerful  pair  of  tech¬ 
niques,  which  we  call  virtualization  and  aggregation,  for  producing  certain  parallel  structures  that 
are  often  complex  (and  generally  recognized  as  "clever'),  given  only  high  level  specifications. 

Intuitively,  virtualization  is  the  addition  of  one  or  more  dimensions  to  an  array,  turning  each  single 
element  into  a  column  (or  plane  or  hyperplane)  that  contains  the  partial  results  of  the  computation 
of  that  element.  For  example,  if  A,,j  is  computed  using  a  single  enumeration,  then  virtualization 
would  produce  a  three  dimensional  array,  say  A',  and  A'  j  lk  would  contain  the  ith  partial  result 
of  this  enumeration.  Virtualizations  we  have  studied  reduce  the  computation  per  array  element  to 
0(1). 

Also  intuitively,  aggregation  is  the  grouping  together  of  processors,  each  of  which  does  a  small 
amount  of  work,  into  groups  of  processors,  each  represented  by  a  single  processor.  Each  processor 
does  all  of  the  work  that  any  processor  in  its  original  group  did,  but  this  can  still  be  done  quickly 
because  each  of  the  processors  in  the  original  group  had  a  small  amount  of  work  to  do,  and  no  two 
processors  had  to  do  their  work  at  overlapping  times.  There  exist  an  enormous  number  of  ways  to 
group  processors,  but  we  will  use  only  simple  ones. 
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Mathematical  Notation 
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number  of  processors  used 

(where  B  is  any  letter)  a  rector  of  4,-.  f  <  s'  <  u  for  some  lower  limit  l  and  upper 
limit  u.  Where  I  and  u  are  particularly  important  and  non-obvious,  4,) 

(or  (6|)  where  i=-u)  may  be  used. 

the  concatenation  of  the  vectors,  7  and  7 

the  length  of  the  rector  7 

the  ordered  sequence  (not  set)  of  integers  from  l  to  u,  inclusive 

n  will  alwayi  be  used  to  denote  some  measure  of  the  site  of  a  problem  to  be 
solved  by  an  algorithm  or  a  parallel  structure. 

cardinality  of  the  set 

Order  g(n)  where  precisely  known.  This  means  that  y(n)  is  (within  a  constant 
factor)  the  best  estimate  of  whatever  is  being  measured  as  n  increases.  Formally, 
/(n)=tf(j(n))  is  defined  as 

3  constants  c,  d ,  c"vheren  >  e  =>  e'y(n)  <  f{z)  <  e"y(n) 

}{n)  is  called  the  asymptotic  behavior  of  q  or  y(n) 
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Section  1 

Problem  Description,  Solution  Techniques  and  Rules 


by 

Richard  M.  King 
Kestrel  Institute 
October  1982 

We  hare  been  studying  the  derivation  of  parallel  computation  structures  which  achieve  an  asymptotic 
improvement  in  the  computation  time,  as  compared  with  the  best  known  sequential  algorithm.  To 
achieve  this  the  number  of  processors  in  use  must  grow  with  the  site  of  the  problem.  We  will  be 
interested  in  cases  where  P  >  9{n),  because  these  offer  the  greatest  opportunities  for  sharing  the 
work  among  a  large  number  of  processors. 

Algorithms  whose  asymptotic  running  time  it  0(n‘)  for  t  >  1  often  use  an  internal  aggregated  data 
structure  whose  site  is  for  some  1  <  j  <  i.  We  tnr  to  create  parallelism  by  assigning  a 
processor  to  each  element  of  the  aggregated  data  structure.  The  structures  most  important  in  this 
work  are  sets,  arrays  of  various  dimensionality,  and  stacks.  This  paper  considers  arrays.  Data 
structure  selection  for  an  algorithm  dependent  on  stacks  or  sets  can  produce  arrays,  so  this  choice 
is  not  overly  restrictive  or  unnatural.  Another  important  issue  in  parallel  algorithm  synthesis  is 
the  connectivity  of  the  resulting  multiprocessor  net.  This  is  especially  important  because  we  seek 
asymptotic  growth  in  the  number  of  processors,  so  too  rich  a  connectivity  may  result  in  a  collection 
of  processors  and  interconnections  that  would  be  impossible  to  fabricate  economically.  We  thus  give 
attention  to  reducing  connectivity. 

In  this  paper  the  term  parallel  itructure,  or  simply  structure,  will  be  used  to  denote  a  program 
designed  for  a  0(n)  or  larger  collection  of  processors  plus  a  specification  of  how  they  should  be 
interconnected. 


jl.l  Taxonomy  of  the  Synthesis  Task 

Figure  1  is  a  taxonomy  of  the  various  states  that  a  synthesis  process  can  be  in,  together  with  the 
possible  synthesis  steps.  We  will  use  such  phrases  as  “a  Class  D  synthesis*  throughout  this  document. 
In  the  taxonomy,  structures  to  the  right  are  more  desirable  than  the  ones  on  the  left,  because  they 
require  fewer  connections  between  processors.  Each  labelled  arc  represents  a  possible  synthesis  step. 

It  might  seem  that  every  Class  D  synthesis  (for  example)  is  harder  than  any  Class  A  or  Class  B 
synthesis,  since  the  result  of  a  Class  D  synthesis  is  the  same  as  the  result  of  a  Class  A  followed  by 
a  Class  B  synthesis.  This  is  not  true  in  general,  although  it  usually  holds.  Some  specifications  are 
especially  suitable  for  some  of  the  ‘higher*  syntheses.  One  example  is  that  a  specification  including 
backtracking  is  often  more  easily  synthesised  into  a  tree-structured  parallel  structure  than  any  other. 
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Figure  1  Taxonomy  of  Syntheses 


We  concentrate  on  Class  D  synthesis  in  this  report  because  it  represents  an  advance  on  our  prior 
work  on  Class  A  synthesis  ([GCP-81]). 


51  2  A  Case  Study:  Polynomial-Time  Dynamic  Programming 

We  have  examined  a  class  of  Polynomial  time  (F-time)  dynamic  programming  algorithms  for  "which 
it  :s  possible  to  synthesize  an  optimal  parallel  scheme.  The  synthesis  uses  rules  displayed  in  §  1.3,  and 
inference  capabilities  which  Kestrel  proposes  to  develop  in  1983,  described  in  [Brown-82].  Abstractly 
programmed  algorithms  in  this  class  include  the  Cocke- Younger-Kasami  parsing  algorithm  for  a 
fixed,  possibly  ambiguous  Chomsky  Normal  Form  grammer,  described  in  [AhcUU-72];  the  Optimal 
Binary  Search  Tree  algorithm,  described  in  [Knuth-73];  and  Optimal  Multiple  Matrix  Multiplication, 
described  in  [AHU-74],  All  of  the  algorithms  fit  into  the  following  scheme. 

Each  algorithm  generates  the  “solution*  to  a  problem  whose  input  is  a  sequence  5  of  n  items  by 
using  a  dynamic  programming  technique.  This  technique  generates  a  solution  for  a  sequence  of  items 
by  combining  solutions  for  contiguous  subsequences.  The  solution  V(iL)  for  a  sequence  R  of  length 
m  is  found  by: 

1.  Generating  the  m—X  possible  partitions  of  R  into  contiguous  subsequences  7  and  1  such  that 

7>fi=R; 

2.  Forming  for  each  partition  a  partial  solution  for  ?|j  J  by  applying  a  function  F  to  V(?)  and  V(  J); 

3.  Obtaining  V (7||^7)  by  combining  (using  a  binary  operation  ©)  all  of  the  partial  solutions.  This 
is  expressed  formally  below: 


v{R)=  o  nV0),V(?)) 

7,77||J-/r 


In  order  to  obtain  the  following  parallel  structure  that  runs  in  time  6{n),  two  conditions  must  hold: 

►  Both  0(i,  y)  and  F(z,  y)  must  take  constant  time, 

►  O  must  be  both  commutative  and  associative.  This  allows  F[V(7),  V(7))  values  to  be  included  in 
the  running  ©-total  in  any  order  they  become  available. 

These  conditions  are  met  by  a  sizable  class  of  algorithms,  e.g.  the  algorithms  mentioned  above.  The 
algorithm  generates  the  solution  V(3)  for  the  original  problem  S  of  length  n.  The  process  starts 
with  V({j,})  for  each  s,  €  3,  then  generates  solutions  for  subsequences  of  length  2,  3,  and  so  on,  up 
to  n.  We  give  below  two  dynamic  programming  algorithms  that  fit  into  this  scheme. 


The  Cocke- Younger-Kasami  algorithm  parses  a  sequence  of  terminal  symbols  according  to  a  fixed 
grammar,  G,  in  Chomsky  Normal  Form.  This  form  specifies  that  each  production  rule  in  the 
grammar  is  either  of  the  form  N  — ►  t  for  nonterminal  N  and  terminal  t,  or  N  — « •  PQ  for  nonterminals 
iV,  P,  and  Q.  In  this  parsing  algorithm,  each  problem  is  a  sequence  of  terminal  symbols,  t,  and 
the  solution  V(t)  is  the  set  of  nonterminal  symbols  that  derive  T.  Let  the  initial  terminal  sequence 
be  (ti ,  .  tn).  Then  V((t,))  are  those  nonterminals  N  for  which  there  is  a  production  rule  in  the 
grammar  of  the  form  N  —  f,-.  Given  two  sequences  of  terminals  A  and  B,  the  nonterminals  that 
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produce  A||5  include  those  nonterminals  N  for  which  there  is  a  rule  N  — »  PQ  where  P  6  U(A)  and 
t  V[B).  The  nonterminals  that  produce  a  sequence  3  are  obtained  by  dividing  the  sequence  3 
into  two  subsequences  in  all  possible  ways  and  talcing  the  union  of  the  results.  In  our  formalism, 


/•(v(3),v(T))={/sr|[^  -  peiecAPgvfSjAee^f)} 


and 

O  is  the  Union  operation,  which  is  indeed  associative  and  commutative. 

Another  example  of  a  dynamic  programming  algorithm  fitting  our  scheme  is  finding  the  complexity 
of  the  optimal  grouping  to  multiply  a  given  sequence  {M\,  Mi  ■  ■  ■  Mn)  of  matrices.  Since  matrix 
multiplication  is  associative,  multiplying  the  matrices  in  different  groupings  produces  the  same  result 
matrix,  but  different  groupings  may  have  different  execution  efficiencies.  If  M  is  a  p  X  q  matrix, 
and  N  is  a  q  X  r  matrix,  then  the  product  M  X  N  will  be  a  p  X  r  matrix,  and  the  multiplication 
will  execute  in  time  proportional  to  pqr  (if  a  simple  matrix  multiplication  algorithm  is  used). 

This  problem  fits  into  the  scheme  presented  above  in  the  following  fashion.  The  “solution*  for  each 
matrix  subsequence  V((AA . . .  M}))  is  a  triple  (p,  q,  e);  p  is  the  row  size  of  Aft;  q  the  column  siie  of 
Mj  (since  multiplication  using  any  grouping  of  {Mi-. .  Mj)  results  in  a  p  X  q  matrix)  and  e  is  the 
optimal  execution  cost  for  computing  M,  X  •  X  Mj.  The  F  for  this  algorithm  is  defined  below- 


^((Pi,<7i,ei).(P2,<7a>«2»  =  +  ‘a  +  Pidi<?a) 

G  for  this  algorithm  returns  the  triple  with  the  minimum  cost  element.  (Since  only  the  costs  can 
differ  among  triples,  ©’»  choice  is  arbitrary  if  the  costs  happen  to  be  the  same.)  The  minimum 
operation  is  associative  and  commutative. 

A  high-level  specification  of  the  dynamic  programming  algorithm  is  presented  below.  A  subsequence 
can  be  represented  by  its  length  and  where  it  begins.  The  array  A  used  below  contains  solutions  to 
subsequences:  the  element  Ai,m  contains  V({t't,  . . .  ,»(+m._i)),  where  7  is  the  initial  sequence.  The 
complexity  of  each  “executable"  statement  is  presented  at  the  right. 

The  algorithm  specification  is  as  follows: 

ARRAY  Ai,m.  l<m<n,  1  <  1  <  n— m  +  1 


INPUT  ARRAY  v,,l  <  I  <  n 

ENUMERATE  f€«l ...«)}  do  *(1) 

Ai,i  «—  vi  0(n) 

ENUMERATE  m  €({2...  n))  do  «(1) 

ENUMERATE  i €{1 .  ■  •  n~m  +  1}  do  ${n) 

Ai  m «—  Q  Ai+»,m_*)  #(*»3) 

1} 


Fiqurt  2.  Specification  of  0(n *)  Dynamic  Programming 


A  cost  of  #(n3)  is  assigned  to  the  evaluations  of  F  and  O  because  it  is  given  that  a  single  evaluation 
of  both  F  and  O  takes  constant  time. 

The  time  complexity  of  the  specified  algorithm  executed  on  a  sequential  machine  is  indeed  0(n*)* 
However,  it  is  possible  to  implement  the  specification  on  a  two-dimensional  array  of  8(n2)  processors 
and  the  resulting  algorithm  will  run  in  0(r»)  time.  The  memory  size  of  each  processor  is  0(n).  Below 

"a  trick  it  available  for  Optimal  Binary  Search  Tim.  This  trick  lovolrts  bounding  k  la  Figurt  2  moro  narrowly 
than  {I  ...m— 1).  This  trick  rtducas  tbs  algorithm's  running  tints  to  ?(n3),  but  It  dots  not  generalise  to  tha  other 
algorithms.  W*  know  of  no  analog  to  this  trick  for  parallai  stmeturee. 


I  2.  A  Casi  Study-.  Polynomial-Timi  Dynamic  Programming 


we  describe  the  operation  of  the  structure,  and  then  proye  that  it  is  a  tf(n)  algorithm.  This  algorithm 
has  been  reported  in  the  literature  [GKT-79]. 

The  network  of  processors  is  displayed  in  Figure  3.  Observe  that  Pf  „  is  connected  to  P|,m_i  and 
Pi. Each  processor  Pi,m  will  compute  the  value  of  Ai,m.  To  do  this  it  needs  two  streams  of 
information.  Ai ,*  and  A|4-*,m  —  t,  where  k  <  m.  These  streams  of  data  come  respectively  over  wires 
from  processors  P(,m_l  and  Pi  +  i<m_(.  Each  processor  Plm  (except  Pi,„)  will  send  every  A-va!ue 
received  from  Pi,m_i  to  Pi,m+i  and  from  Pi+1-m_t  to  P(_lim+1  as  soon  as  P, gets  it.  Each 
processor  will  also  compute  /"-values  and  merge  them  into  a  running  ©-total  as  soon  as  it  gets  the 
A-vaiues  necessary. 

Pi.i  Pj.i  Pj,i  P«,i 


Figure  3.  Processor  Interconnections 

At  first  glance,  it  might  appear  that  this  algorithm  has  time  complexity  0(na).  Each  processor  needs 
to  receive  0(n)  A-values  from  each  of  its  incoming  wires;  it  must  at  some  time  perforin  fl(rt)  worth  of 
computation  on  the  data  received  before  it  sends  its  result  on  each  of  its  outgoing  wires.  However, 
a  careful  timing  argument  shows  that  an  execution  time  of  9{n)  can  be  achieved. 

Definition  1.1.  Within  P,  m,  for  any  k  where  1  <  k  <  to,  A|*  and  Ai+»>m_*  are  called  a 
complementary  pair  of  A-values. 

Processor  P,  m  will  apply  F  to  each  complementary  pair  of  A-values. 

The  next  lemma  shows  that  each  processor  Pi,m  receives  all  2m— 2  values  it  needs,  though  it  waits 
f(m)  for  its  first  complementary  pair,  and  Ai+fm/al,m— fm/iV 

Lemma  1.2.  Each  processor  Pim  where  1  <  m  >  1  receives  the  values  Ai  mi  where  1  <  m'  <  to 
and  (separately)  Ai  +  m—m',m'  where  m'  <  m  in  order  of  increasing  m! . 

Proof  By  induction  ou  m.  Clearly  this  is  true  for  Pi.a,  which  receives  only  one  value  on  each 
of  its  incoming  wires.  Now  suppose  it  is  true  for  Pi,m_i  and  Pj+lim_].  Then  P|_m  will  receive 
A-vaiues  in  the  proper  order  from  Pi,m— i  and  through  m'  =  m— 2,  following  which  it 

receives  and  Ai+iim_t  from  those  processors.  But  the  latter  two  A-elements  are  just  those 

required  to  preserve  the  sequences.  | 

system  startup  T=0,  and  after  x  units  of  time  T=t .  The  time  unit  satisfies  the  first  condition  of 
the  following  lemma. 

Lemma  1.3.  If  all  of  the  following  conditions  are  met : 

►  All  of  the  following  takes  processor  Pi,m  no  more  than  one  unit  of  time :  receiving  two  values,  one 
each  from  Pt,m-i  and  P(+i>m_j;  sending  these  values  on  to  Pi,m+i  and  Pt—i,m+i;  applying  the 
function  F  twice  to  two  complementary  pairs  of  A-values  if  all  values  are  available;  and  merging 
the  resulting  value  into  a  running  Q -total. 

►  The  A-values  come  into  Pjt^,  «n  the  order  indicated  by  Lemma  l.S. 
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1.  Problim  Discretion,  Solution  Techniques  and  Rules 

»  EiicA  proceuor  P  im  tends  values  received  from  P|irn_i  resp  Pj_^  i  ,m_i  to  Pi.m-i-i  r**P  Pi-i,m+i 
no  later  than  one  time  unit  after  receipt. 

►  At  T  —  0  processor  P1#1  transmits  At,i. 

then  P;  will  compute  Ai  m  no  later  than  T=2m. 

Proof  By  induction  P(il  is  initialised  to  know  .  Now  suppose  the  lemma  is  true  for  m!  <  m 
and  suppose  T—2m"  for  some  m"  <  m.  First  proie  a  sublemma,  that  at  this  time  P|,m  will  have 
included  at  least  max(0,  2(m"  —  1  m/2l))  /-values  in  its  running  ©-total.  This  subletnma  is  proven 
by  induction  on  m"  —  fm/2]. 

When  reading  the  proof  of  the  sublemma,  keep  in  mind  that  the  'life*  of  a  processor  Pjirn  is  divided 
into  three  epochs: 

1.  When  T  <  m,  the  processor  may  have  received  no  A-va!ues. 

2.  When  m  <  T  <  l^m,  the  processor  will  have  received  at  least  T—m  A-vah:es  from  each  of 
its  inpu*  lines.  Since  the  first  half  of  the  A- values  from  each  inbound  wire  form  complementary 
pairs  wi;h  the  lait  half  of  the  values  from  the  other  inbound  wire,  Pi,m  may  not  have  been  able 
to  perform  any  calculations  of  any  /-values  yet. 

3.  When  T'^  Jm,  the  processor  will  have  received  at  least  half  (more  accurately,  at  least  T — m)  of 
the  values  from  each  inbound  wire.  During  each  unit  interval,  it  will  receive  one  A-value  from 
each  inbound  wire,  which  will  ‘match*  with  some  value  that  was  stored  from  the  other  wire 
during  epoch  2  Two  /-calculations  will  be  possible  -  one  for  each  of  the  just- received  inbound 
data 

If  21=0  the  sublemma  requires  nothing.  If  m"—\mf 2]  >  0,  consider  the  situation 

m"  —  I’m/2]  before  T—2m".  All  processors  Pij  and  Pi+>.y,  where  j  <  fm/21  -f  m"— I'm/21  will 
have  completed  heir  work  and  their  answers  will  have  had  time  to  reach  P|,fA.  Thus  at  least 
2(m"— f m / 2 1 )  pairs  of  A-values  will  have  arrived.  Since  (by  induction  on  m"~ fm/2])  two  time 
units  ago  2(m"  — fm/2])— 2  /"-values  had  already  been  merged  into  the  running  ©-total  there  is 
plenty  of  time  to  merge  two  new  /"-values  into  the  running  ©-total,  completing  the  induction  step 
of  the  suSlcmma. 

Lemma  13  follows  immediately  from  the  sublemma  and  the  observation  that  the  merging  of  m—  1 
/-values  into  the  running  ©-total  in  P i,m  constitutes  a  calculation  of  Aj  m.  | 


Theorem  1.4.  The  time  to  compute  AiiR  is  0(r»). 

Proof  :  Immediate  from  Lemma  1.3.  | 

In  the  next  lection  we  will  show  how  this  parallel  structure  can  be  derived  from  the  specification  in 
Figure  2. 


51.3  Rules  for  Parallel  Structure  Synthesis 

Rules  for  the  Class  A  synthesis  task  appear  elsewhere  (see  [GCP-81]).  This  report  describes  the  task 
of  synthesis  of  parallel  structures  for  arrays  of  processors  in  which  the  interconnections  describe  a 
k-dimensional  lattice  for  some  k,  i.e.  a  Class  D  synthesis  task. 

As  examples  of  rule  application  and  a  demonstration  of  the  rules’  effectiveness,  we  apply  each  rule  to 
the  /-time  dynamic  programming  specifications.  We  will  repeat  that  specification  here,  augmented 
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l  3  Rulis  roR  Parau.il  Structurs  Syrthssis 
with  output  array  description! 


ARRAY  Aitm,  1  <  m  <  n,  1  <  1  <  n— m.  +  1 

INPUT  ARRAY  v,,  1  <  l  <  n 

OUTPUT  ARRAY  0 

ENUMERATE  l€((l . . .  n))  do 

9(1) 

Ai,i  *-  vi 

9(n) 

ENUMERATE  m  £{(2  . . .  n))  do 

9(1) 

ENUMERATE  1€{1 . . .  n-m  4-  1}  do 

9(n) 

Ai.m  «—  Q  F(At  k,  Af+*iin_») 

«(»') 

*6(1  m—  1) 

0«-A,,n 

9(1) 

Figure  4  Specification  of  9(ns)  Dynamic  Programming  with  Explicit  I/O 


1.3.1  Preparatory  Rules 

The  problems  now  amenable  to  Class  D  synthesis  have  internal  arrays  of  storage,  and  the  requirement 
is  to  fill  in  the  array  by  computing  a  value  for  each  element.  Our  strategy  will  be  to  assign  a  processor 
t .  each  element  of  the  array.  This  rule  declares  a  processor  for  each  element  of  the  main  array  in  the 
problem,  and  composes  a  single  enumerated  PROCESSORS  statement.  This  statement  has  several 
clauses:  the  processors  definition  clause,  the  HAS  clause,  the  HEARS  clause(s),  and  the  USES 
clause(s).  Any  part  of  the  PROCESSORS  statement  except  the  processors  definition  clause  can  be 
made  conditional.  An  example  of  a  PROCESSORS  statement  is  shown  below: 

PROCESSORS  P(  m,  I  <  m  <  n,  1  <  l  <  n-m  +  1 
HAS  A,,m 

if  m=l  then  USES  vt,  HEARS  Q 
If  2  <  m  <  n  then 

USES  Au,l  <  k  <m 
USES  A(+*,m_*,l  <  k  <  m 
HEARS  P,.m_, 

HEARS  P|+  i,m—  i 


This  statement  means  all  of  the  following: 

►  A  family  of  processors  exists.  The  family  name  is  P.  Each  member  of  the  family  is  named  by 
two  indices,  and  any  member  P(ifn  exists  ifl<m<nAl<f<  rs— m  +  1.  The  value  n  is 
an  externally  defined  constant  value  (fcr  any  instance  of  the  problem)  defining  the  problem  sise. 
This  PROCESSORS  statement  actually  declares  some  facts  about  every  processor  in  the  family. 

►  Each  element,  Pi,m,  of  this  family  is  responsible  for  computing  the  value  of  (i.e.  HAS)  Ai,m.  A  is 
an  array  declared  elsewhere  in  the  specification  that  contains  the  PROCESSORS  statement. 

►  If  Pi,t  is  defined  it  needs  vi  to  compute  its  HAS  values,  and  it  expects  to  get  these  values  from 
(i.e.  HEARS)  the  (only)  processor  in  the  Q  family. 

►  If  P|  m  is  defined  and  2  <  m  <  n,  then  Pi,m  needs  the  values  of  Ajt*  for  any  k,  1  <  k  <  m — 1. 
It  also  needs  Ai+*.m_»  for  any  k  in  that  range.  It  expects  to  get  these  values  from  processors 
in  the  P  family,  namely  Pi,m— i  and  Pi4.i,m_|.  The  scope  of  the  bound  variables  list  (in  this 
ease,  "l,  m")  is  the  entire  PROCESSORS  statement.  Two  PROCESSORS  statements  must  have 
distinct  processor  names  (in  this  case,  “P"). 


Tl 


S  l.  Problem  Description,  Solution  Techniques  and  Rules 

1.3. 1.1  Rule  Al:  Give  Each  Non-I/O  Array  Element  its  Own  Processor 

By  our  conventions,  the  portion  before  the  *  — »  ’  is  the  antecedent  and  the  rest  is  the  corue- 
quent  Variables  free  in  the  antecedent  are  implicitly  existentially  quantified  and  the  scope  of  this 
quantification  is  the  entire  rule.  Variables  free  only  in  the  consequent  are  universally  quantified  (but 
this  is  rare)  A  rule  is  said  to  apply  if  the  antecedent  is  true;  when  this  happens  the  semantics  of 
the  rule  is  to  make  the  consequent  true.  It  is  explicitly  permissible  for  the  consequent  to  make  the 
antecedent  no  longer  true. 


rule  MAKE-PSs  (**)  TRANSFORM 
X  STATEMENT 
A  X  €  **  STATEMENTS 
A  XVARRAY  NAMEbovnd  ENUMERS' 

A  Y  =i(GENSYM  TROC) 

A  Z:'PROCESSORS  Ybound  ENUMERS  HAS  NAMEbound' 
Ze"  STATEMENTS 

of  this  quantification  is  the  entire  rule.  Variables  free  only  in  the 
MAKE-PSs  applied  to  (P .  1)  binds  as  follows: 

bindings: 


"—((entire  specification)) 

X =‘ARRAY  A(>m,  1  <  m  <  n,  1  <  l  <  n — m  -+■  1’ 
NAME='A' 

BOUND=z‘l,m’ 

ENU MERS-='l  <  m  <  n,  1  <  i  <  n-m  +  1’ 

Y='P’ 

2=‘PROCESSORS  Pi,mi  l<m<n,  l</<  n — m  -f  1 
HAS  Ai  m’ 


obtaining 

(P.2)  ARRAY  Ai,m,  1  <  m  <  n,  1  <  l  <  n-m  +  1 

|  PROCESSORS  Pi,mi  1  <  m  <  n,  1  <  1  <  n— m  +  1  HAS  A(,m 


INPUT  ARRAY  1  <  I  <  n 
OUTPUT  ARRAY  0 

ENUMERATE  f€«l...n))  do  9(1) 

Ar,i  *-  «i  0(n) 

ENUMERATE  m  £((2...  n»  do  0(1) 

ENUMERATE  f€{l...  n-m +  1}  do  9{n) 

Al,  rn  4  O  F{At.k.At+k,m-k)  6(n3) 

*€{l...m—  i) 

0+-Al>K  <(1) 


as  the  new  state  of  the  database. 


1.3.  Rulu  roR  Parallsl  Structurb  Synthesis 


1.3.1. 2  Rule  A2:  Assign  I/O  Arrays  to  Processors 

This  rule  assigns  a  itnglt  processor  to  each  input  or  output  array.  The  reason  only  a  single  processor 
is  assigned  is  that  it  is  assumed  that  input  values  will  reside  in  a  single  entity,  such  as  a  tape  drive. 


rule  MAKE-IOPSt  (**)  TRANSFORM 
^.STATEMENT 
AX’ 6  **. STATEMENTS 
A  X:‘IO  ARRAY  NAMEBOund  ENUMERS * 

A(/0=’INPUT  V/0=‘0UTPUT) 

Al'=(GENSYM  'PROC) 

A  ^‘PROCESSORS  Y  HAS  NAMEBOund  ENUMERS' 

— ♦ 

Z  e  “  STATEMENTS 

Rules  M  AKE-PSs  and  MAKE-lOPSs  make  PROCESSORS  statements  that  do  not  have  USES 
and  HEARS  clauses  yet.  The  next  rule  fills  in  those  clauses,  and  subsequent  rules  improve  them. 

Rule  M  AKE-lOPSs  applies  for  two  sets  of  bindings: 


**={{entire  specification)) 
X=‘OUTPUT  ARRAY  O' 
10='  OUTPUT 
NAME=0 

BOU  N  D=(empty  string) 

ENU  MERS={empty  string) 

Y= R 

J=‘PROCESSORS  R 
HAS  O’ 


*  *=((entire  specification )) 
X='TNPUT  ARRAY  v,,  1  <  /  <  n’ 
70=TNPUT 
NAME=v 
BOUND='V 

ENUMERS=‘l  <  l<n‘ 

Y=q 

Z=‘PROCESSORS  R 

HAS  vi,  1  <  l  <  n’ 


resulting  in 


(*■3) 

ARRAY  Ai,m,  l<m<n,  1<1<  n— m  +  1 

PROCESSORS  P i, mi  1  <  m  <  n,  1  <  /  <  n-m  +  1  HAS  Ai,m 
INPUT  ARRAY  v,t  1  <  l  <  n 

I 

PROCESSORS  Q  HAS  v,,  1  <  l  <  n 

OUTPUT  ARRAY  0 

1 

PROCESSORS  R  HAS  0 

ENUMERATE  1 6<(l...n))  do 

*(1) 

(P.3o) 

Ai,i  «-  vi 

<(n) 

ENUMERATE  m  6«2  . . .  n»  do 

*(D 

ENUMERATE  f €{1... n— m+  1>  do 

*(«) 

(P.36) 

Ai,m «-  Q  E(Ai,k,  Ai+k,m—k) 

*€{l...m — I) 

*("3) 

[P- 3c) 

0  *—  AliW 

»(1) 

So  far,  all  rule  application  can  be  done  in  a  straightforward  manner,  without  inference. 
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1.  Problem  Description,  Solution  Techniques  and  Rules 


1.3. 1.3  Rule  A3:  Determine  Processors’  Inputs 

We  need  rules  to  describe  the  connections  between  processors  and  the  data  that  processors  need  to 
produce  results.  This  rule  is  Tery  conservative  -  it  determines  what  array  values  each  processor  P‘ 
needs,  and  it  specifies  a  direct  connection  from  the  processors  holding  those  values  to  P*.  The  USES 
clause  describes  the  values  that  a  processor  needs;  the  HEARS  clause  describes  the  processors  that 
have  (HAS)  these  values. 

To  determine  this,  consider  the  innermost  loop  which  assigns  values  to  array  elements  indexed  by 
non-region-constants.  Note  that  the  form  of  the  rule  shown  below  evidences  a  need  for  elaborate 
flow  analysis.  Non-constant  array  index  expressions  are  used  as  processor  indices.  The  indices 
for  those  array  elements  whose  values  can  aHect  the  assigned  value  comprise  the  index  expressions 
for  the  USES  and  HEARS  sets.  A  reference  at  the  same  loop  level  will  normally  generate  USES 
and  HEARS  clauses  with  null  enumerations.  A  reference  contained  in  a  deeper  loop  will  normally 
generate  instances  of  such  clauses  with  inherited  enumerators  from  the  loops 


rule  MAKE-USES-HEARS  (•*)  TRANSFORM 

••"PROCESSORS  PDCLpav  PENUMER  HAS  ANAMEbindbx' 

A  CB=**.CONTAINING-BLOCK 
A  X =(INNER-LOOP-THAT-DEFINES  ANAME  CB) 

A  YE  (ARRAY-REFERENCE, S-AFFECTING  X) 

A  2=(EFFECTIVE-ENUMERAT0R-0F  Y  X) 

A  ^.CONDITIONS  -CB  CONDITIONS  U(INFFRRED-CONDITIONS  X) 

A  W.CLASS  =USES-CLAUSE 
A  W.ARG  =‘ ANAME 

(REL-BV  PBV  X.DEF-OF  INDEX- EXP R  Y  Z) 

(RELENUMER  PBV  X.DEF-OF  .INDEX-EXPR  Y  Z)’ 

A  Q. CONDITIONS  =CB.CONDITIONS  U(INFERRED-CONDITIONS  X) 

A  Q. CLASS  =HEARS-CLAUSE 
A  HISBV^ANAME  PROCSTMT  PUOC-BV-OF 
A  Q.ARG  ^'ANAAfEPROC-OF 

(REL-BV  HISBV 

X.DEF-OF  INDEX-EXPR  Y  Z) 

(RELENUMER  HISBV  X  DEF-OF  INDEX-EXPR  Y  Z)’ 


W g  •".clauses 
A  Q  g  "‘.clauses 

The  INNER-LOOP-THAT-DEFINES  function  finds  an  innermost  locality  where  an  element  from  the 
argument  array  is  defined  (not  merely  used).  The  ARRAY-REFERENCES-AFFECTING  function 
returns  a  set  of  all  points  in  the  program  where  an  array  is  referenced  and  the  value  returned  can 
affect  the  results  of  its  operand,  a  program  point.  Tbe  EFFECTIVE-ENUMERATOR-OF  function 
determines  what  (possibly  implicit)  enumerators  its  first  argument  (an  array  reference)  is  controlled 
by,  beyond  the  enumerators  that  control  its  second  argument  (an  array  definition  in  this  case). 

The  map,  z. CONDITIONS,  allows  any  node  z  to  be  placed  under  the  influence  of  conditions  (an  If 
clause).  INFERRED-CONDITIONS  is  a  function  that  produces  an  if  clause  that  specifies  exactly 
those  conditions  that  must  be  true  for  the  point  representing  the  argument  to  be  reached  (a  form  of 
assertion  propagation). 

REL-BV  and  RF.LENUMER  give  a  piece  of  text  that  respectively  will  serve  as  a  bound  variable 
and  an  enumerator  for  the  fragment  enumerated  by  the  fourth  argument  to  be  valid  for  the  third 
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argument  in  the  context  of  the  second  argument,  uting  the  bound  Tariablet  of  the  first  argument. 
This  would  be  the  bound  variables  of  the  fourth  argument  unless  there  is  a  variable  name  clash. 

This  modifies  the  first  PROCESSORS  statement,  which  becomes 

PROCESSORS  Pi,m,  1  <  m  <  n,  1  <  /  <  n— m  +  1 
HAS  A,m 

|  If  m=l  then  USES  vlt  HEARS  Q 

Application  to  the  assignment  to  A|,m  in  (P.3b)  produces 

PROCESSORS  Pi,m,  1  <  m  <  n,  1  <  l  <  n—m  +  1 
HAS  A,,m 

if  m=l  then  USES  vt,  HEARS  Q 
|  If  2  <  m  <  «  then 
|  USES  A),*,  1  <  Jfc  <  m 

|  USES  Ai+*,m_*,l  <  k  <  m 

|  HEARS  Pi,*,  1  <  k  <  m 

|  HEARS  Pi-,.*.m_*,l  <  Jt  <  m 


Finally,  apply  MAKE-USES-HEARS  one  last  time,  to  the  null  “enumeration*,  (P.3c),  that  sends 
the  output  value  to  the  output  “array*,  O.  This  forces  us  to  modify  R’s  PROCESSORS  statement 
as  follows: 


PROCESSORS  R  HAS  O 

|  USES  A,,n  HEARS  P,,n 


This  statement  is  in  its  final  form. 

The  applications  of  MAKE-USES-HEARS  require  flow  analysis  and  some  ability  to  reason  about 
enumeration  (to  construct  If  clauses). 


1.3.2  Optimisation  Rules 

The  rest  of  the  rules  described  in  this  section  will  transform  the  simplest  parallel  structures  into 
more  efficient  ones.  They  do  this  by  detecting  and  removing  redundant  interconnections. 


1.3.2. 1  Rule  A4:  Improve  HEARS  clauses 

It  may  be  that  a  HEARS  clause  of  a  PROCESSORS  statement  requires  each  processor  to  be  con¬ 
nected  to  more  than  one  other  processor.  This  is  undesirable,  because  the  number  of  interconnections 
in  the  whole  collection  of  processors  would  grow  faster  than  the  number  of  processors,  and  the  cost 
of  interconnections  would  exceed  the  cost  of  processors  for  sufficiently  large  problems.  This  would, 
in  turn,  decrease  the  siia  of  the  largest  problem  that  could  be  handled  by  a  given  parallel  structure. 

However,  often  it  is  not  necessary  for  each  processor  to  be  connected  to  all  other  processors  whose 
values  it  needs.  If  processor  P«  needs  values  from  processors  P»  and  Pe,  but  P>  needs  a  value  from 
processor  Pe,  it  may  not  be  necessary  for  P,  to  be  connected  to  Pe.  P4  must  be  connected  to  P», 
but  P*  will  be  able  to  get  the  value  that  P.  wants  from  Pc,  so  it  (P»)  can  pass  that  datum  along. 


if 


1.  Problem  Discretion,  Solution  Tbchniqubs  and  Ruli* 


This  form  of  this  observation  only  secures  a  constant  factor  reduction  in  the  number  of  intercon¬ 
nections  (in  this  case,  from  two  to  one),  but  it  is  possible  to  do  better  by  extending  the  principle. 
Suppose,  for  example,  that  a  structure  includes  a  family  of  processors  P,-  for  1  <  «  <  n.  Further 
suppose  that  Vt,y  where  j  <  »',  P,  needs  values  from  Py.  In  this  case,  P,+  i  will  need  all  the  values 
P,  needs,  plus  the  value  in  Py  itself. 

Basic  Observation  1.5.  In  a  case  such  at  this  P y  is  capable  of  supplying  all  of  the  information  that 
PJ  +  i  need t,  to  it  it  pottikle  to  modify  the  structure  to  replace  the  S(n)  connections  required  by  thil 
HEARS  claute  by  a  tingle  connection. 

Definition  1.6.  In  a  parallel  structure,  a  family  of  processors  is  the  set  of  processors  defined  by  a 
tingle  PROCESSORS  itatement  when  enumerated  over  the  PROCESSORS  claute 't  enumerator. 
That  family  it  generated  by  that  PROCESSORS  itatement. 

Definition  1.7.  The  tet  of  procetton  in  a  procettor  P„ 's  family  HF.ARd  by  P„  due  to  a  HEARS 
claute  Ho  will  be  written  H o(P»). 

Definition  1.8.  Coruider  Ho(Pa)  and  Ho(Pt).  Suppote  that  each  it  a  subset  of  the  tame  family  as 
P.  and  P»  (which  are  in  the  tame  family  because  they  both  have  the  tame  HEARS  claute,  Ho).  The 
interconnection!  defined  by  Ho  telescope  if  these  teti  Ha(?t)  and  Ha(Pt)  either  are  ditjoini  or  one 
strictly  contains  the  other,  for  any  choice  ofPa  and  Pj  in  the  family.  We  alto  tay  that  IIo  telescopes. 
If e  /—a,  :[0  C  H0(PJ  C  tf0(P,)=*  3p.  e  :[H0(P.)U{P.}=^o(Pc)l]  then  H0  snowballs. 
The  notion  of  a  USES  claute  telescoping  it  defined  similarly.  A  partition  it  induced  by  a  telescoping 
claute  Co  if  two  processors  are  in  the  tame  partition  whenever  the  sets  defined  by  Co  overlap. 

Theorem  1.8.  If  a  HEARS  claute  Ho  tnowballi,  it  can  be  replaced  by  another  HEARS  clause  that 
only  tpecifiet  input  from  a  tingle  procettor. 

Proof  .  Consider  the  family  of  processors  described  by  the  PROCESSORS  statement  that  contains 
the  HEARS  clause.  Consider  also  the  induced  partition  II. 

If  the  cardinality  of  an  equivalence  class  E  €  II  is  (say)  e,  then  VP,  g  £:|//0(P,)|  <  c.  (No  processor 
can  HEAR  itself  because  it  would  never  be  able  to  complete  its  calculation  if  it  needed  its  own  result 
to  do  so.)  Since  'dz,y.zgAy  |/fo{P*)|  5^|Ho(Py)|,  and  since  |{0 . . .  c—  1} |=c,  the  processors  in  E 
can  be  completely  ordered  by  the  cardinalities  of  their  HEARd  sets.  By  the  basic  observation  and 
the  snowballing  property,  each  processor  can  get  the  information  that  Ho  requires  from  the  processor 
that  is  its  predecessor  in  this  ordering.  | 

Definition  1.10.  We  call  replacing  a  REARS  claute,  at  in  the  previous  theorem,  reducing  the  claute. 

We  expect  to  be  able  to  prove  the  following  result;  it  “falls  out*  of  a  generalisation  of  Theorem  1.5 
for  which  we  are  working  out  a  rigorous  proof. 

Conjecture  1.11.  Reducing  a  tnowballing  1IEARS  claute  will  produce  a  parallel  itructure  whose 
asymptotic  speed  it  the  tame  at  the  speed  of  the  original  itructure. 

We  can  now  state  this  rule  in  English  as  follows:  *If  a  REARS  clause  snowballs  then  reduce  it’,  and 
more  formally  as  follows: 
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rule  REDUCE-HEARS  (**)  TRANSFORM 

••^PROCESSORS  PNAMEpdv  PENUMER .  .  If  CONDI  then 

HEARS  PNAMEHbv 
HENUMER...' 

A  PENUMER. CLASS  =ENU MERS 
A  HENU MER. CLASS  -ENUMERS 
A  CONDI. CLASS  =PREDICATE 
A  COND2. CLASS  =PREDICATE 
A  CONDI. FREE-VARS  C  PDV 
A  CON  D2.F  RE  E-VARS  C  PDV 
A  SET1=(B0UNDBY  PDV  (1)  HBV  HENUMER) 

A  SETla=(BOUNDBY  PDV  (2)  HBV  HENUMER) 

A  PK0C1=(B0UNDBY  PDV  (1)  PDV  0) 

A  SET2=(B0UNDBY  PDV  NIL  HBV  HENUMER) 

A  PK0C2=(B0UNDBY  PDV  NIL  PDV  0) 

A  PROCh=(BOUNDBY  PDV  NIL  HEX  PR  0) 

A  (THEOREM 

((SETl  CiSETla)  £{0  SET  1  SETla} 

A  ((0  C  SETl  C  SETla  A  CONDI)  =3  SET  1  U PROCl=SET2) 

A  (CONDI  A  COND2  SE71  uPROCh=SET2))) 

••^PROCESSORS  PNAMEpdv  PENUMER  ... 

HEARS  PN AMEhkxpr  ...  ’ 

when  this  rule  is  applied  to  the  current  state,  the  bindings  will  be  as  follows: 

••^PROCESSORS  P,,mi  1  <  m  <  n,  1  <  l  <  n-m  +  1 

HEARS  Pi+*.m_»,  1  <  k  <  m-1 ...  * 

PNAME-'P' 

PDV z='l,  m’ 

PENU  MER=‘\  <m<n,  l<f<  n — m  4- 1' 

HBV='l  k,  m—k’ 

HENU  MER='\  <  k  <  m—  V 

S ETl={((fi  4-  t,TOj— i)):  1  <  t  <  m,1  — 1} 

SETla={((i2  4-  k,m2—k)):l  <k<  m,-l> 

PROC^m.mJ)} 

SET2={((<  4-  *,  m—k)):l  <k<  m-1} 

PROC2={«/,m})} 

PROCh={{{l  4-  l,m — 1))} 

HEX  PR=‘{{1  4- 1,  m — 1))’ 

CONDl='2  <  m  <  n’ 

C0ND2=true 


THEOREM  is  a  function  whose  argument  is  a  symbolic  set-theoretic  expression  whose  atomic  terms 
are  set  expressions.  These  expressions  are  principally  created  by  the  BOUNDBY  function,  whose 


u 
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inputs  are  the  bound  variables  list  of  the  processor  name  id,  an  identity  parameter,  the  form  that 
defines  the  array  references  that  comprise  the  array  definition,  and  the  enumerator  (if  any)  for  the 
array  reference. 

PDV  will  take  values  of  sequences  of  bound  variables,  and  HBV  will  be  sequences  of  expressions. 

BOUNDBY  is  a  quaternary  function  which  returns  an  object  that  acts  like  a  set- valued  expression 
with  free  variables.  Its  four  arguments  are: 

►  A  sequence  of  variable  names,  called  VNS. 

►  An  identification  index,  called  1IX. 

►  A  sequence  of  expressions,  called  EXS. 

►  A  set  of  enumeration  operators,  called  EOPS. 

I  gave  each  argument  a  name  here  for  easy  reference.  BOUNDBY  composes  the  object  to  return 
by  first  associating  a  ‘subscripted*  new  free  variable  (not  to  be  confused  with  an  array  element 
reference)  with  each  variable  in  VNS.  The  subscript  is  IIX  (and  if  IIX  is  NIL  there  is  no  subscript). 
Every  occurrence  of  an  element  of  VNS  in  EXS  or  EOPS  is  also  subscripted. 

In  the  BOUNDBY  expression  defining  PROCh,  HEXPR  (implicitly  existentially  quantified)  is 
constrained  by  the  THEOREM  and  PROCh=  . . .  expressions. 

Some  bounds  on  the  range  of  possible  values  for  HEXPR  are  necessary.  Something  like 

v* 3 y.x e  hexpr  =»  s e{y, 'v  +  r, *p-i*}  a  ye pdv 


would  serve. 

COND2  is  also  constrained  by  the  theorems  that  can  be  proven. 

This  rule  reduces  the  HEARS  clauses  from  the  large  PROCESSORS  statement  of  the  current  state 
to 


HEARS  P ,,  m—  1 
HEARS  P)+l>m_, 


The  resulting  PROCESSORS  statement  it 

PROCESSORS  P,,m,  1  <  m  <  n,  1  <  1  <  n— m  -|- 1 
HAS  Ar,m 

If  m=l  then  USES  v,,  HEARS  Q 
if  2  <  m  <  n  then 

USES  A(,a,1  <  k  <  m 
USES  Ai+*>m_*,l  <k<m 
HEARS  P,,m_, 

HEARS  Pi+j>m_i 


Figure  5.  Final  Form  of  Main  Processors  Statement  in  P-time  Dynamic  Programming  Derivation 


1.3.2. 2  Rule  AS:  Write  the  Individual  Processors'  Programs 

The  general  idea  of  the  rule  is  that  the  first  rule  isolated  the  deepest  enumeration  in  the  specification 
which  assigned  a  value  to  an  array  element,  and  built  the  beginnings  of  a  parallel  structure  where 
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each  array  element  within  the  domain  of  that  enumeration  had  iti  own  private  processor.  Since  the 
enumeration  in  time  has  been  replaced  by  an  enumeration  in  space,  the  layers  of  enumeration  that 
get  us  to  the  point  which  induces  the  creation  of  the  first  parallel  structure  can  be  stripped  away. 

A  technical  note  is  that  the  enumerations  can  only  be  completely  discarded  when  there  is  no 
calculation  at  intermediate  levels.  If  there  is  such  calculation,  the  system  will  have  to  add  it  to  the 
appropriate  processors  when  it  strips  away  the  layers  of  enumeration  that  include  such  calculation 
as  well  as  the  deeper  enumeration.  This  does  not  make  the  asymptotic  bekarior  of  the  parallel 
structure  any  slower  except  when  the  calculations  include  enumerations.  When  this  is  the  case,  it 
might  be  possible  to  respecify  the  problem  to  have  separate  copies  of  the  array  enumerated  in  the 
calculation  for  each  cell  of  the  target  array.  This  would  require  an  array  whose  dimension  is  the 
sum  of  the  dimensionalities  of  the  two  arrays. 

This  rule  is  expressed  in  English  as  follows:  "Supply  each  processor  specified  by  a  PROCESSORS 
statement  with  a  copy  of  those  enumerations  from  the  original  program  that  occurred  within  the 
region  that  included  the  assignment  to  array  elements  that  generated  that  PROCESSORS  statement. 
The  references  to  array  elements  are  replaced  by  associative  lookups  from  the  table  of  information 
that  the  processor  has  HEARd.  The  outer  enumerations  are  stripped  from  the  program,  and  uses  of 
the  variables  that  were  bound  in  these  outer  enumerations  are  replaced  '  y  constants  reflecting  the 
processor’s  ID.* 

The  derivation  of  the  /’•time  dynamic  programming  parallel  structure  is  almost  complete.  It  remains 
only  to  reduce  the  depth  of  enumeration  to  the  single  level  implicit  in  the  segment, 

Ai.m*-  0  Al+*,m—  k) 

»ed  ."I— i) 

Rule  AS  does  this.  The  complete  parallel  structure  that  results  is  as  follows 

ARRAY  A|mi  1  <  m  <  n,  1  <  f  <  n  — m  +  1 
PROCESSORS  P|,m>  1  <  m  <  n,  1  <  /  <  n— m  +  1 
BAS  A,.m 

If  m=l  then  USES  v,,  BEARS  Q 
If  2  <  m  <  n  then  USES  Ac  *,  1  <  *  <  to 
USES  1  <  i  <  to 

HEARS  P,.*,-! 

HEARS  P|-f.ii<n_i 
ARRAY  v,,  1  <  I  <  n  INPUT 
PROCESSORS  Q  HAS  u<,  1  <  <  <  n 
OUTPUT  ARRAY  o 
PROCESSORS  RHAS  0 
( include  if  m=l):  A|it  *-  u( 

(include  if  m>\):  Attm  *-  0  Jr(A),»,  A|+»iln_») 

‘€{t  •  "»-!> 

(include  if  1=1  A  m=n):  O  At,, 


1(1) 

tf(n) 

#(1) 


1.3.2.3  Rule  A6:  Improve  Topology  of  Input/Output 

We  discovered  that  the  rules  described  so  far  will  produce  a  parallel  structure  in  which  every 
processor  is  directly  connected  to  the  input  and  output  processors  when  given  a  specification  of 
array  multiplication.  Only  one  I/O  processor  is  created  per  I/O  array,  and  for  many  problems, 


IS 


1.  Froblbw  Dbscriftion,  Solution  Techniques  and  Rulbi 


including  array  multiplication,  it  ij  necessary  to  get  some  input  or  output  from/to  every  processor. 
(Atime  dynamic  programming  is  an  exception,  in  which  only  0(n)  of  the  0(n2)  processors  receive 
input  values  and  the  output  is  only  a  single  value.) 

We  therefore  conceived  another  rule  to  attempt  to  reduce  the  excessive  connectivity  that  results 
from  every  processor  needing  access  to  input  or  output. 

This  rule  is  also  not  yet  formulated  in  V,  but  it  states  that  if  the  following  conditions  are  met: 

►  the  number  of  processors  n,  in  a  family  that  receives  input  from  or  sends  output  to  a  given 
processor  is  asymptotically  unacceptable,  and 

►  there  is  a  HEARS  clause  Ha  such  that  the  number  of  processors  that  do  not  HEAR  any  processor 
using  H0  clause  (if  input)  or  that  are  not  HEARd  by  any  processor  using  that  clause  (if  output) 
is  asymptotically  less  than  n„ 

then  the  I/O  HEARS  clauses  can  be  reduced  so  that  only  those  processors  at  a  source  (or  terminus 
if  output)  of  H0  are  directly  connected  to  the  I/O  processor. 


1.3.2. 4  Rule  A7:  Create  Interconnections  in  a  Family  to  Reduce  I/O  Connectivity 

Rule  A6  allows  the  reduction  of  connections  from/to  an  I/O  processor  where  a  set  of  interconnections 
already  exists  to  solve  the  I/O-free  portion  of  the  problem.  In  some  problems,  including  array 
multiplication,  no  convenient  set  of  interconnections  exists  and  one  must  be  introduced  solely  to 
distribute  I/O  values.  Fortunately,  the  rule  that  would  do  this  is  fairly  simple  to  state  and  is 
evidently  implementable,  given  the  mechanisms  already  required  for  REDUCE- HEARS. 

The  rule  is:  where  a  single  USES  clause  telescopes,  order  the  induced  partition  (definition  1.9)  by 
the  processor  indices  and  interconnect  the  processors  in  each  partition  with  a  new  HEARS  clause 
where  each  processor  is  connected  (only)  to  its  immediate  predecessor  (if  any)  in  this  ordering. 


51  4  A  Derivation  of  Fast,  Parallel  Array  Multiplication 

Computer  scientists  have  proposed  many  parallel  schema  for  the  array  multiplication  problem, 
probably  because  it  is  a  practically  important  problem  and  seems  so  obviously  amenable  to  parallel 
processing.  One  of  the  prettiest  parallel  structures  is  described  in  [KungLei-761.  Rung’s  algorithm 
multiplies  an  n  x  n  array  in  0(n)  time  using  9(n3)  processors  of  constant  sixe.  (Rung  maices  the 
assumption  that  a  solution  that  invo  ves  i(n )  processors  in  communication  with  the  outside  world 
is  acceptable.  This  subsection  follows  that  assumption.)  The  best  known  sequential  algorithm  uses 
0(n2  8l)  multiplications,  but  the  obvious  parallel  structure  using  rs1-81  processors  to  do  the  job  in 
linear  time  does  not  work;  processors  cave  to  wait  for  other  processors  and  have  to  receive  copies  of 
their  results. 

With  the  rules  and  postulated  mechanisms  .'or  deriving  information  not  locally  obtainable  from  the 
specifications  it  does  not  seem  possible  to  derive  Rung’s  systolic  array.  It  is,  however,  possible  to 
derive  another  parallel  structure  with  I. near  execution  time.  We  added  rule  A7  with  this  derivation 
in  mind,  but  do  not  feel  that  A7  is  contrived  or  impractical. 

Our  parallel  structure  is  inferior  to  syndic  arrays  because  it  uses  more  processors  on  a  restricted 
class  of  matrices  called  ‘band  matrices,*  in  which  all  but  a  narrow  diagonal  band  of  the  input 
matrices  (and  therefore  of  the  output  matrices)  contains  zero  values. 

The  starting  point  of  this  derivation  is  *  specification  of  array  multiplication  (we  are  assuming  square 
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arrays  to  simplify  the  discussion): 


INPU'i  ARRAY  At,mi  1  <  l  <  n,  1  <  m  <  n 
INPUT  ARRAY  Bi,m,  1  <  /  <  n,  1  <  m  <  n 
ARRAY  C,  mi  I  <  /  <  n,  1  <  m  <  n 
OUTPUT  ARRAY  D,,m,  1  <  l  <  n,  1  <  m  <  n 


ENUMERATE  «€{l...n}  !{ 1) 

ENUMERATE 9{n) 

C,.,-  52  «("s) 

*€(!■»} 

-  Cij  9(r>}) 


The  use  of  arrays  C  and  Z?  seems  redundant,  but  its  purpose  is  technical  -  our  rules  would  not 
permit  us  to  assign  multiple  processors  to  a  single  array  if  that  array  were  an  INPUT  or  OUTPUT 
array  Duplicating  all  of  the  arrays  in  this  manner,  to  aYoid  all  appearances  of  'prejudicing  the 
case’  of  which  array’s  parallelism  would  be  important,  would  only  change  in  the  resulting  parallel 
structure  in  that  each  processor  would  be  replaced  by  a  family  of  three. 

MAKE-PSs  and  MAKE-IOPSt  add  PROCESSORS  statements, 


INPUT  ARRAY  A,.„,  l  <  l  <  n,l  <  m  <  n 
|  PROCESSORS  PA  HAS  Ai,mi  l  <  l  <  n,  1  <  m  <  n 
INPUT  ARRAY  Bi,m,  1  <  l  <  n,  I  <  m  <  n 
|  PROCESSORS  PB  HAS  B,,mi  1  <  l  <  r»,  1  <  m  <  n 
ARRAY  C|.m,  1  <  l  <  n,  1  <  m  <  n 
|  PROCESSORS  PCt,m,  I  <  l  <  n,  I  <  m  <  n  HAS  Ct,m 
OUTPUT  ARRAY  D i,m,  l  <  l  <  n,l  <  m  <  n 
|  PROCESSORS  PD  HAS  D|,m,  1  <  1  <  n,  1  <  m  <  n 


ENUMERATE  ie{l...r»> 

*(i) 

ENUMERATE  ;€{l..n} 

*(") 

C’i.j  *—  52 

Ci,j 

t(n3) 

MAKE-USES-HEARS  completes  the  rough  form  of  these  statements. 


1.  Problem  Description,  Solution  Technique*  and  Rules 


ARRAY  Ai  m,  1 

<  t  <n,  1 

<  m 

<  n  INPUT 

PROCESSORS 

PA  HAS  Ati 

<  1  < 

n,  1  < 

m  < 

n 

ARRAY  B,.mi  1 

<  /  <  n,  1 

<  m 

<  n  INPUT 

PROCESSORS 

PB  HAS  B|, 

,?rt  i  1 

<  1  < 

n,  1  < 

m  < 

n 

ARRAY  Ct,m>  1 

<  /  <  n,  1 

<  m 

<  n 

PROCESSORS 

PC,,m,l  < 

l  <  i 

n,l  < 

m  <  n 

HAS 

C, 

USES  Ai,kll  <  k  <  n 
USES  <  k  <  n 

HEARS  PA 
HEARS  PB 

OUTPUT  ARRAY  DUm,  1  <  l  <  n,  1  <  m  <  n 
PROCESSORS  PD  HAS  D|.m,  1  <  l  <  n,  1  <  m  <  n 
USES  C/m,  l<l<n,l<m<n 
HEARS  PCi,m,l  <  I  <  n,l  <  m  <  n 


ENUMERATE  i  £{1 ...  n} 

ENUMERATE  ;€{1  ...n}  S(„) 

C\,i  J(n3) 

D'.i  *“  0(na) 


REDUC E-H EARS  is  unable  to  improve  this  parallel  structure,  because  there  are  no  interconnec¬ 
tions  among  the  PCs  to  improve.  Rule  A6  is  also  helpless,  although  the  topology  of  the  intercon¬ 
nection  graph  is  too  rich  (0(na)  rather  than  the  goal  of  0(n)).  Rule  A7  comes  to  the  rescue.  Adding 
the  HEARS  clauses  allowed  by  A7  and  by  the  USES  clauses  of  PC  produces: 


1) 


l  •»  A  Dirivation  or  Fast,  Pahallil  A  a  ray  Multiplication 

ARRAY  Al-m,  1  <  l  <  n,  1  <  m  <  n  INPUT 
PROCESSORS  PA  HAS  Alm,  1  <  l  <  n,  1  <  m  <  n 
ARRAY  fll  m,  1  <  /  <  n,  1  <  m  <  n  INPUT 
PROCESSORS  PB  HAS  B,  mi  \  <  l  <  n,  \  <  m  <  n 
ARRAY  C|,m,  1  <  l  <  n,  1  <  m  <  n 
PROCESSORS  PC|  1  <  /  <  n,  1  <  m  <  n  HAS  C,jrn 
USES  Ai  k,  1  <  i  <  n 
USES  Bk'„,  1  <  t  <  n 
HEARS  PA 
HEARS  PB 

I  if  m  >  1  then  HEARS  PC|  m_, 

!  If  (>  1  then  HEARS  PC,_,,m 

OUTPUT  ARRAY  D,,m,  1  <  l  <  n,l  <  m  <  n 
PROCESSORS  PD  HAS  Dl  rn,  1  <  l  <  n,  1  <  m  <  n 
USES  C,.m,  1  <  l  <  n,l  <  m  <  n 
HEARS  PCl  mi  1  <  i  <  a,  l  <  m  <  n 


ENUMERATE  »6{l...n} 

#(1) 

ENUMERATE  /£{  1 . . .  n> 

<>(n) 

C>,j  Ai'kBk.j 

*(«') 

*e{l...n} 

D  »,1  4  ^l,J 

»("*) 

Then  rule  A6  is  applied  twice,  and  rule  A5  once,  finishing  the  derivation. 

ARRAY  A|  m,  1  <  l  <  n,  1  <  m  <  n  INPUT 
PROCESSORS  PA  HAS  A,,mi  1  <  l  <  n,l  <  m  <  n 
ARRAY  Bt,„,  1  <  l  <  n,  1  <  m  <  n  INPUT 
PROCESSORS  PB  nAS  Bi,m,  1  <  l  <  n,  1  <  m  <  n 
ARRAY  C,.m,  \  <  l  <  n,l  <  m  <  n 
PROCESSORS  PC|.m,  l  <  l  <  n,  \  <  m  <  n  HAS  C(  m 
USES  Au,  1  <  k  <  n 
USES  Bk'„,l  <  k  <n 
I  if  m=l  then  HEARS  PA 

|  if  1=1  then  HEARS  PB 

if  m  >  1  then  HEARS  PC(im_! 

If  /  >  1  then  HEARS  PCi_1-m 
OUTPUT  ARRAY  D,,m,  1  <  l  <  n,  1  <  m  <  n 
PROCESSORS  PD  UAS  Dt,m,  1  <  l  <  n,l  <  m  <  n 
USES  C|m,  l</<n,i<m<n 
HEARS  PC|,m,  1  <  /  <  n,  1  <  m  <  n 

I  Cl,m  53  }{n) 

I  ^(,m  Cl,m  £(l) 


so 
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§15  Creation  of  Virtual  Arrays,  Processor  Aggregation 


1.5.1  An  Informal  Description  of  the  Techniques 
Consider  the  enumeration  in  Subsection  1.3. 2. 2, 


There  is  an  enumeration,  but  only  over  r-values.  For  this  reason,  use  of  separate  processors  will  not 
be  generated  for  the  steps  of  the  enumeration. 

Now  one  can  make  a  few  changes  to  the  specification  in  order  to  generate  separate  processors  for 
the  steps  of  the  enumeration.  (This  will  be  motivated  later.) 

Generate  the  following  viriuaiitation,  creating  the  array  A’ 


ARRAY  1  <  m  <  ri.l  <  f  <  n-m  +  1,0  <k<  #{1.  .m-1} 

4,m,0  -  iateO 

ENUMERATE  k  €((1 . . .  m-1))  do 


This  structure  represents  several  changes: 

►  First,  it  introduces  a  new  dimension  to  the  main  array  for  each  level  of  enumeration  performed 
to  find  a  value  for  the  old  elements  of  the  array. 

►  Second,  the  enumeration  k  €{1 . . .  m— 1}  into  the  enumeration  k  £((1 . . .  m— 1))  is  changed.  This  is 
perfectly  legitimate — the  set  enumeration  does  not  forbid  enumeration  in  a  specified  order.  When 
we  consider  automating  this  process,  however,  we  should  remember  that  there  are  m!  ordered 
enumerations  corresponding  to  a  specific  unordered  one  of  length  m.  The  best  orderings  to  try  will 
probably  include  the  arrival  orderings  inferrable  from  HEARS  and  HAS  clauses,  and  the  ‘natural* 
orderings,  i.e.  numerical  order  and  inverse  numerical  order  (where  numbers  are  involved). 

Of  course,  this  only  applies  when  the  inner  enumeration(s)  enumerate  over  a  set.  When  the 
enumerand  is  already  a  sequence,  this  step  and  the  fifth  are  unnecessary. 

►  Third,  the  value  bastQ,  the  value  of  046^,  is  introduced. 

►  Fourth,  the  function  ((1 . . ,  m— 1))— *,  the  inverse  of  {(1 . . .  m—  1))  considered  as  a  function,  is  also 
introduced.  This  will  in  fact  be  a  function  whenever  it  cun  be  shown  that  the  sequence  has  no 
duplicate  elements,  which  will  certainly  be  the  case  where  the  sequence  is  simply  an  ordering  of 
some  set,  and  will  often  be  the  case  otherwise. 

►  Fifth,  the  running  totals  implicit  in  the  Q(set)  notation  are  explicated. 

For  ^-time  dynamic  programming  virtualisation  is  worse  than  useless.  The  extra  processors  serve  no 

purpose,  they  need  to  communicate  with  each  other,  and  their  existence  forces  the  data  to  arrive  in 

a  specific  order.  More  sophisticated  virtualisation  heuristics  could  produce  a  different  virtualisation 

and  eventually  a  different  parallel  structure  by  choosing  a  different  base  case  and  enumeration  order. 

This  technique  is  not  useful  on  this  specification. 
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However,  consider  the  case  of  linear  arra y  multiplication.  Application  of  the  seven  rules  produces 
the  following  parallel  structure: 

ARRAY  Aj,m,  1  <  I  <  n,  1  <  m  <  n  INPUT 
PROCESSORS  PA  HAS  Ai,m,  1  <  1  <  n,  1  <  m  <  n 
ARRAY  B,,m,  1  <  1  <  n,  1  <  m  <  n  INPUT 
PROCESSORS  PB  HAS  1  <  I  <  n,  1  <  m  <  n 
ARRAY  Ci,m,  1  <  l  <  n,  1  <  m  <  n 
PROCESSORS  PCi,mi  1  <  1  <  n,  1  <  nt  <  n  HAS  C,,m 
USES  A|  *,  1  <  *  <  n 
USES  Bk,mt\  <  k  <  n 
|  If  m=l  then  HEARS  PA 
|  If  1=1  then  HEARS  PB 

if  m  >  1  then  HEARS  PCi,m_  i 
if  1  >  1  then  HEARS  PCi_,,m 
OUTPUT  ARRAY  D(,m,  1  <  1  <  n,  1  <  m  <  n 
PROCESSORS  PD  HAS  Di,m.  1  <  l  <  n,  1  <  m  <  n 
USES  C(.mi  1  <  I  <  n,  1  <  m  <  n 
HEARS  PCl  r„,l  <  1  <  r»,  1  <  m  <  n 


C|,m  ^ 

9(n) 

*€{1  n) 

Dl.m  Ct,m 

9(1) 

The  asymptotic  behavior  of  thii  parallel  structure  seem*  to  be  the  same  as  that  for  Kung’s  parallel 
structure  [KungLei-76].  However,  there  can  be  an  advantage  of  Kung’s  parallel  structure  over 
the  simpler  one.  With  multiply  “band  matrices’,  where  } — i  <  io,o  V  j—i  >  ki,o  =»  A.,;=0  and 
j—i  <  *o,i  V  j—i  >  *i,i  =>  A,-.y=0,  it  is  possible  to  use  fewer  processing  elements.  If  *i,o— *o,o  + 
l=wo  and  *i,i — *i,o  +  l=«»t,  then  it  can  be  shown  that  only  (ui0  +  *»i)n  of  the  n 3  processors 
of  our  parallel  structure  can  have  non-xero  answers,  and  only  that  many  processors  have  to  be 
provided.  With  Kung’s  parallel  structure,  however,  only  v.'o  (srx  processors  have  to  be  provided.  The 
multiplication  takes  $(n)  time.  (It  is  possible  to  use  the  9((v>o  +  wi)n)  processors  to  multiply  the 
band  matrices  in  9( tu0  4-  wi)  time,  but  this  parallel  structure  cannot  be  synthesiied  automatically 
using  these  techniques,  and  in  any  event  the  time/processors  tradeoff  offered  by  Kung’s  parallel 
structure  may  be  desirable.) 

The  virtualisation  process,  aloue,  is  not  enough  to  synthesise  Kung’s  systolic  arrays.  Notice  that  the 
cumber  of  processors  in  the  parallel  structure  that  results  from  the  obvious  virtualisation  is  0(ns). 
Partial  sums  of  product  array  elements  reside  in  different  processors  at  different  times.  This  feature 
makes  some  technique  like  virtualisation  necessary  to  separate  the  computation  of  partial  products, 
but  processors  have  to  be  grouped  to  prevent  this  processor  count  blowup.  Another  more  difficult 
technique,  aggregation,  will  reduce  the  processor  count  to  the  target  level. 

Heuristically,  aggregation  is  the  grouping  together  of  processors,  each  of  which  does  a  small  amount 
of  work,  into  groups  of  processors,  each  represented  by  a  single  processor.  Each  processor  does  all  of 
the  work  that  any  processor  in  the  original  group  did,  but  this  can  still  be  done  quickly  because  each 
of  the  processors  in  the  original  group  had  a  small  amount  of  work  to  do,  and  no  two  processors  had 
to  do  their  work  at  overlapping  times. 

The  reason  why  Kung’s  parallel  structure  can  multiply  arrays  in  linear  time  using  constant  space 
per  processor  is  that  he  has  performed  a  virtualisation  on  the  summation  of  result  array  elements. 
He  avoids  the  need  for  n3  processors  by  a  process  called  procenor  aggregation.  Each  processor  is 
responsible  for  computing  J(n)  elements  of  the  virtual  array. 
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1 1  1.  Problem  Description,  Solution  Techniques  and  Rules 

Reasoning  similar  to  that  performed  in  the  ehange-of-basis  generator  and  theorem  proper  will  serve 
us  well  here  The  target  interconnection  structure  is 

PROCESSORS  Plim, -n  <  m  <  n,—n  <  l  <  n~m  +  1 

HAS  C\  j  k,  1  <  i  <  n,  1  <  j  <  n,  1  <  it  <  n,i—j—l—m,  k=  min(n— 1  4-  l,n  +  m  +  l,n) 

USES  Ai,i,  1  <  i  <  n.l  <  j  <  n,i-j=l 

USES  Bi  j,  1  <  t  <  n,  1  <  j  <  n,t— ;'=m 

HEARS  P,_lim 

HEARS  P,.m+1 

HEARS  P|+i  m_i 


which  is  Kung’s  structure.  This  requires  two  changes  of  basis  of  input  arrays  (« — j  of  both  A  and 
S,  rather  than  either  «  or  j),  and  a  change  of  basis  for  the  C  array,  as  well  as  some  rather  subtle 
timing  arguments  and  replacement  of  the  summation  of  each  C-array  element  over  a  set  of  integers 
to  a  summation  over  a  sequence  of  integers. 

The  figures  on  the  following  pages  illustrate  the  virtualisation  and  aggregation  processos,  as  they 
apply  to  an  n=3  instance  of  a  matrix  multiplication  problem. 


1.5.2  Formal  Definitions  of  Aggregation  and  Virtualisation 

Definition  1.11.  A  'rirtualization  of  a  parallel  itructure  it  a  new  parallel  itructure  that  reiulti  from 

*■  adding  a  dimention  to  an  array,  tay  A,  producing  A1  at  follow! :  if  A j  it  a  defined  element  of  A, 
and  the  computation  of  A;  it  performea  by  enumerating  n  elements  of  tome  tet  or  vector  S  and 
performing  a  binary  operation  on  a  running  total  and  each  element  of  S  at  if  is  enumerated,  then 
Ai|m  for  0  <  m  <  n  will  be  a  defined  element  of  the  new  array,  A'; 

►  making  the  enumeration  of  S  an  ordered  one;  and 

►  replacing  the  original  enumeration/calculation  with  a  calculation  that  explicitly  folds  the  jik  value 
of  the  ordered  enumeration  at  performed  for  A;  by  operating  on  Ai|y_,  and  that  f"1  element. 

The  procett  of  creating  a  virtualization  it  alto  called  virtualization. 

Definition  1.13.  An  aggregation  of  a  parallel  itructure  it  a  new  parallel  itructure  that  reiulti  from 
partitioning  the  old  let  of  procetton  of  a  family  into  equivalence  clatiei,  and  creating  a  procettor  for 
each  equivalence  clan.  A  procettor  in  the  aggrtgationUEARi  another  tuch  processor  if  any  processor 
in  the  first  equivalence  clan  HEARrf  any  procettor  in  the  tecond. 

The  procest  of  creating  an  aggregation  it  alto  called  aggregation. 

There  are,  of  course,  an  intractible  number  of  possible  aggregations  according  to  this  definition. 
Only  simple  aggregations  are  worthy  of  consideraton,  because  allowing  complex  ones  would  lead 
to  a  combinatorial  explosion  and  because  the  complex  ones  would  tend  either  to  leave  too  many 
interprocessor  connections  or  to  have  too  much  work  being  done  in  some  of  the  processors. 

Suppose  that  the  virtualised  family  of  processors  is  defined  as 

(written  P*) 

We  feel  that  interesting  aggregations  would  identify 

VJ,f:  3Pf ,Pf_(_i;  =4  Pj  3  Pf+i; 

where  i=((tl,«3  . .  ,im)),  all  the  «,€{— 1,0,1),  and  1  ranges  over  integers.  Here  3P...  means  that  a 
given  processor  exists  (and  is  not  out  of  bounds  of  the  original  virtualisation.)  Early  aggregation 
systems  will  confine  themselves  to  this  case. 
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IS  Practicality  Consioirations  M 

1.5.3  What  Virtualiiation  Can  and  Cannot  Accomplish 

An  important  measure  of  the  cost  of  a  parallel  structure  is  the  product  of  the  number  of  processors, 
the  si2e  of  each  one,  and  the  amount  of  time  the  parallel  structure  takes  to  do  a  calculation.  I  will 
call  this  the  PST  measure. 

PST=$((a/o  -r  ttii)n2)  for  the  simpler  parallel  structure  for  matrix  multiplication,  -when  applied  to 
band  matrices  of  widths  wo  and  tot.  Virtualization  and  aggregation  can  improve  this  to  $(woW\n) 
by  reducing  the  number  of  processors  while  allowing  the  size  of  the  processors  and  the  running  time 
of  the  algorithm  to  remain  the  same. 

It  is  possible  to  achieve  PST=S((u»o  +  uii)2n2)  by  other  means.  This  is  equivalent  whenever 
«ui=$(tu0).  Divide  the  n  X  n  array  of  potential  processors  into  (to0  +  tt^)  x  («o0  +  blocks 
and  introduce  input  and  output  connections  at  the  appropriate  edges  of  each  such  block.  This  is 
impossible  to  derive  by  techniques  shown  so  far,  or  reasonable  extensions  to  them.  It  has  the  further 
disadvantage  that  the  number  of  connections  to  input  and  output  processors  is  0(n)  ,  while  the 
same  number  is  0(itio«ui)  for  the  systolic  array  parallel  structure  that  results  from  virtualization  and 
aggregation.  A  complexity  measure  that  took  into  account  the  connections  to  the  I/O  processors 
would  favor  the  systolic  array  structure  even  over  the  improved  simple  matrix  multiplication  scheme. 

It  should  be  noted  that  the  parallel  structure  resulting  from  partitioning  the  potential  processors 
has  the  same  PST  as  systolic  arrays,  but  P  and  T  are  different.  Different  measures,  such  as  PST2 
may  make  different  parallel  structures  more  desirable. 


51.6  Practicality  Considerations 

In  addition  to  the  results  described  above,  we  have  investigated  the  problems  that  will  be  encountered 
when  automatically  derived  parallel  structures  are  used.  A  parallel  structure  will  in  general  specify 
a  collection  of  interconnections  that  may  not  correspond  to  any  ‘off  the  shelf*  product.  We  have 
begun  to  develop  several  concepts  which  Kestrel  intend  to  explore  further  in  1983,  but  we  will 
describe  them  briefly  here.  These  considerations  will  be  important  when  actual  use  of  a  system  for 
automatically  generating  parallel  structures  is  contemplated. 


1.6.1  Basis  Change 

The  topology  of  a  parallel  structure  may  be  the  same  as  that  of  an  existing  multiprocessor  machine, 
but  this  fact  may  not  be  evident  because  of  the  nature  of  the  indices.  Suppose,  for  example,  that 
multiprocessor  systems  of  various  sizes  organized  as  square  grids  were  commonly  available,  but  that  a 
user  had  submitted  an  instance  of  P-time  dynamic  programming  to  the  parallel  structure  generator 
and  received  the  result  described  above.  The  parallel  structure's  topology  fits  half  of  a  square  grid, 
but  this  fact  is  ‘hidden’  under  our  choice  of  indexing.  A  change  of  basis  can  expose  this  fit. 


1.6.1  Granularity  Considerations 

Many  of  the  rules  in  this  derivation  system  (and  most  of  the  need  for  inference)  results  from  our 
unwillingness  to  consider  as  realizable  a  parallel  structure  where  every  processor  is  connected  to 
every  other.  A  consideration  we  labelled  granularity  persuades  us  that  even  a  parallel  structure  in 
which  every  processor  is  connected  to  only  a  constant  number  of  other  processors  and  where  the 
interconnection  diagram  is  planar  may  be  unrealizable  in  the  future,  where  it  will  be  common  to 
have  more  than  one  processor,  but  not  a  complete  system,  on  a  “chip*. 

The  d-dimensional  lattice  architecture  may  not  be  the  ideal  architecture  for  hardware  implementation 
for  a  couple  of  reasons  to  be  discussed  in  this  section.  One  reason  is  that  the  connections  specified 
may  be  too  rich  for  an  efficient  VLSI  implementation. 


U  1.  Problem  Description,  Solution  Technique*  and  Rule* 

NVhen  a  multiprocessor  system  is  built  on  a  single  chip,  or  when  each  processor  of  a  multiprocessor 
system  is  on  its  own  chip,  the  concepts  we  intend  to  introduce  are  of  no  importance.  However,  it 
;s  important  to  consider  the  case  where  each  chip  contains  several  processors,  but  not  a  complete 
system. 

The  maximum  practical  ‘pin  count*  of  a  chip  may  limit  efforts  to  place  ever  increasing  numbers  of 
processors  on  a  chip  as  our  fabrication  technology  improves.  This  is  a  separate  limitation  from  wire- 
count  limitations  and  ptanarity  limitations.  For  example,  in  a  two  dimensional  array  of  processors 
(each  processor  has  two  coordinates,  each  within  a  range  of  integers,  and  P,,;-  is  connected  to  Pi,;±i 
and  Pi±i,>)  the  interconnection  is  obviously  planar  and  the  number  of  wires  is  proportional  to  the 
number  of  processors.  This  topology  is  therefore  realisable  on  a  single  chip  or  in  a  configuration 
with  one  chip  per  processor.  However,  if  our  technology  would  otherwise  allow  N 3  processors  on 
3  chip,  and  a  system  with  M  >  N3  processors  is  desired,  the  number  of  busses  from  one  chip  to 
others  would  be  (except  for  chips  on  the  edges  of  the  array).  This  may  require  more  pins  than 
can  be  placed  on  a  compact  package. 

To  see  how  the  various  proposed  architectures  fare  under  the  criterion  of  minimising  pin  count  as 
processor-per-chip  count  increases,  consider  the  following  table. 

interconnection  geometry  busses  per  N-processor  chip  in  M-processor  system 

complete  interconnection 
perfect  shuffle 
binary  hypercube 
i-dimensional  lattice 
augmented  tree 
ordinary  tree 

Figure  6.  Interconnection  Requirements  for  Various  Architectures  (tentative) 

It  may  be  possible  to  improve  bounds  marked  with  an  *  by  an  asymptotically  small  factor  using 
suitable  constructions.  Such  improvements  will  not  yield  a  qualitative  difference  in  the  sense  of  the 
argument. 

For  any  architecture  above  the  horiiont&l  line,  any  decrease  in  X  (the  element  site  of  a  chip’s  logic 
elements  or  integrated  wires)  is  useless  without  a  proportional  decrease  in  the  chip’s  pin  spacing. 
This  is  not  true  for  architectures  below  the  line.  For  those  architecture,  it  is  possible  to  preserve  the 
pin  spacing  as  X  decreases,  provided  the  chip’s  area  or  pin  density  is  increased  modestly. 

In  the  tree  structured  architectures,  most  of  the  processors  will  be  in  multiprocessor  chips,  which 
we  call  leaf  chips  because  they  contain  the  leaf  processors.  These  chips  each  hold  V  leaf  processors 
for  some  j,  plus  2,— 1  other  processors  necessary  to  tie  the  leaves  together.  Pairs  of  chips,  including 
leaf  chips,  will  be  tied  together  with  single  processor  chips  having  three  busses  each  (or  five  for 
augmented  architectures;  see  Figure  7).  The  number  of  single  processor  chips  is  one  less  than  the 
number  of  leaf  chips. 

A  construction  that  eliminates  the  single-processor  chips  in  return  for  increasing  the  buss  connections 
required  for  all  chips  by  a  modest  constant  factor  has  been  described  [BhattLei-82]. 


NM 

2  AT* 

V(log(A t/N))' 
21og(AT  +  1)  +  1 
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Section  2 


Inference  Requirements  Analysis  and  Implementation  Proposal 


Tom  C.  Brown 
Kestrel  Institute 
October  1982 


§2.1  Introduction 

Inference  requirements  for  two  of  Richard  King's  concurrent  computing  system  synthesis  rules 
(A/ A KE-USES-HEARS  AND  REDUC  E-HEARS)  are  analysed  and  shown  to  be 

►  intractable  in  their  more  general  forms 

►  tractable  under  realistic  constraints  which  include  the  applications  thus  far  considered. 

The  ad-hoc  constraints  bring  to  bear  special-case  decision  procedures  for  extended  Presburger  arith¬ 
metic  and  systems  of  linear  constraints  [Shostak-77,79,81]. 

The  first  rule  [5  1.3. 1.3]  documents  data-flow  dependencies  for  iterative  array  computations.  Each 
element  of  an  0(nr)  -  element  array  is  defined  exactly  once  by  a  sequence  of  iterative  array- 
element  assignments  using  inputs  and  previously  deflned  array  elements.  The  solution  is  in  effect 
a  parameterized  description  of  a  disjoint  covering  of  the  computation-array  index  set.  Under 
reasonable  constraints  this  covering  can  be  computed  in  linear  time  and  verified  (disjointness, 
completeness)  in  quadratic  time,  as  a  function  of  the  number  of  iterated  assignment  statements 
in  the  input  specification. 

The  second  rule  [5  1  3.2.1]  recognizes  a  phenomenon  called  inowballing,  wherein  each  member  of  an 
0(n)-element  ordered  array  or  processor  family  depends  on  results  of  each  predecessor.  This  0[n2) 
dependency  pattern  is  reduced  to  an  O(n)  connection  pattern  wherein  each  processor  receives  results 
from  its  immediate  predecessor  (or  input)  and  forwards  them  (plus  its  own  result)  to  its  immediate 
successor  (or  output).  Heuristic  guidance  for  the  solution  is  extracted  from 

►  the  physical  adjacency  postulate:  processors  with  "nearby"  indices  are  candidates  for  immediate 
connection  (the  Heart  relation) 

►  the  linearity  postulate:  each  DEARS  clause  defines  a  linear  one-dimensional  (O(n))  subfamily  of 
the  processor  index  set. 

These  constraints  are  easily  tested.  Once  verified,  the  snowballing  property  reduces  to  a  simple  test 
which,  instead  of  being  0(2J*  )  as  in  the  general  Presburger-arithmetic  decision  problem  [Shostak-79| 
of  which  it  is  an  instance,  is  linear  (in  the  input  HEARS-elause  length,  under  reasonable  assumptions, 
5  2.4-5). 
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n 


§2.2  Data  Flow  Analysis 

The  MAKE-USES-HEARS  rule  operates  on  a  specification  wherein  a  processor  hat  been  assigned 
to  each  computation  array  element  and  each  I/O  array.  It  extracts  from  the  program  a  set  of  in/erred 
condition  and  corresponding  USES  and  HEARS  clauses.  The  conditions  are  inferred  from  index 
ranges  of  enumerated  (iterated)  assignment  statements.  The  rule  makes  allowances  for  the  fact  that 
iteration  index  variables  need  not  correspond  to  Processor  index  variables,  or  that  first  even  and 
then  odd  rows  may  be  computed,  etc. 

Consider  the  schema  [King-82], 

1  PROCESSORS  Pi>m,  1  <  m  <  n,  1  <  I  <  n— m  +  1 

2  HAS  Ai,m 

3  PROCESSORS  Q 

4  HAS  ii|,  1  <  /  <  n 

5  PROCESSORS  R 

6  HAS  O 

7  enumerate  f  6((1 . . .  n})  do 

8  Ai'ti  <—  Vf 

9  enumerate  m'  €{(2  . . .  n))  do 

10  enumerate  fg{l...n — m!  ■+■  1}  do 

11  A|',m'  A'+V.m'— *') 

*'€{1  -m>— 1} 

12  0  Al,  n 

Following  line  2  we  should  use  lines  7-8  to  infer  the  condition 


(P.3a)  If  m=l  then 

USES ^,1  <  l  <  n 
HEARS  Q 

because  the  assignment  (line  8)  binds  m  to  1  and  sets  Ai>,i=ti|>(f'— 1 . . .  n). 

Similarly,  we  should  adjoin  the  clauses 

(P.3b)  If  2  <  m  <  nthen 

1  USES  Aj,*,l  <  *  <  m— 1 
HEARS  Pj.a,  1  <  i  <  m—  1 

2  USES  Ai+a.m-a,  1  <  *  <  w— 1 
HEARS  Pi+*>m_*,l  <  k  <  m— 1 

where  again  the  inferred  condition  2  <  m  <  n  is  derived  directly  from  the  controlling  enumeration 
(line  9).  The  two  subclausos  (1)  and  (2)  are  not  part  of  the  inferred-conditions  derivation  whose 
automation  is  the  subject  of  this  section;  however,  the  rule  derives  (1)  by  selecting  Ai>,t<,  in  line  11 
and  noting  that  the  definition  of  uses  A|>,v  for  Jt'=l,  1,  and  similarly  for  (2)  using 

Av+f  tni—k'  These  mechanisms  are  already  encoded  in  King’s  rule. 

In  general  the  inferred  condition  problem  is,  given  declaration 


J 


S8 


2.  Inpbrbncb  Rbquirbmints  Analysis  and  Implbmbntation  Proposal 


ARRAY  AjjiVHi,  . 


(i) 


with  domain  A  ...  A  Rp }  and  a  lilt  of  titrated  assignments 


enumerate  ji'.Si 
enumerate  jq:Sq 

Af0)  *-  clA,(7.I.):l  ^  1  ^  rl  (2) 


verify  that  the  corresponding  sets 


{/(7):S|  A  ...  A  5,}  (2') 

form  a  disjoint  covering  of  A  ...  A  Clearly  this  condition  i*  belt  tested  by  expressing 
each  condition  (2')  in  the  form 


{i:S{  A  ...  AS'}  (3) 

where  s{  «  3j.[/(j)=i  A  S*(J)!;  moreover,  (3)  is  exactly  the  inferred  condition  required.  Clearly 
(3)  is  uniquely  defined  from  (2')  iff  /  is  injective  (one  to  one)  on  {y:St  A  ...  A  5,};  otherwise  A/C) 
is  defined  twice. 

To  ensure  effectiveness  of  the  reduction  to  (3)  we  require  that  /  be  a  linear  transformation  from  Z» 

to  Z”: 


/(J)*=7*  x  J  +  dk  (4) 

where  X  J  is  the  inner  product  at  f*  and  ;.  Similar  linearity  constraints  are  placed  on  Rkl  5*  - 
e  g.  Rk  has  the  form 


<  C»  X  jk  +  Dk  <  Uk  (5) 

where  £.t,  £/*,  <7*,  £*  may  contain  j\ . }k-i,n  free. 

Now  the  covering  of  {i:R}  (A’t  domain)  is  disjoint  iff  Sf  A  Tf  is  vnsotisfiable  tor  each  pair  [Sf,  Tl) 
such  that  {i:S^}  and  are  distinct  instances  of  (3).  In  this  conjunction  n  is  a  Slcolem  constant. 

The  disjointness  condition  can  be  readily  tested  if  in  (3), 

S{  A  ...  A  Sj  is  a  Presburger  formula  with  constants  (e.g.  i,  n)  (6) 

Then  the  decision  procedure  of  [Shostak-79]  applies.  This  condition  is  clearly  satisfied  by  the  above 
example,  and  all  others  in  (King-82j. 

The  covering  condition  can  be  tested  similarly.  If  <e:Ti),  ...,{i:Tr)  are  the  instances  of  (3)  then 
they  cover  {t:R}  iff 


Vn,i.(R=»r,  V  ...  V  Tf\ 


2  3.  Reducing  Processor  Interconnection  Degree  M 

which  reduces  to  extended-Presburger  decidability  of 

R  A~  Tj  A  . .  A~  Tr- 


Notice  that  the  HEARS  clause  for  (2)  is  obtained  by  first  transforming  the  assignment  (2)  with 
constraints  S\  A  ■  •  ■  A  5,  on  j  into  an  assignment 


A(«) 


Gf\A 


9f{ «.**):! 


with  constraints  S{  A  ...  A  S£  on  i.  This  implies  that  gtG.Tci)  must  also  be  linear.  The  are 
variables  bound  by  iterated  operators  in  Gr\ . . .  j  -  e  g.,  (0*>e{t...m'— i)  ^ (  •  • ))  10  *‘oe  H  above. 

To  conclude,  the  inferred-conditiont  function  requires  moderate  ability  to  reason  about  systems  of 
symbolic  inequalities  in  extended  Presburger  arithmetic,  to  rename  variables  and  to  invert  linear 
operations  appearing  in  such  formulas.Initially  the  inferred-conditions  transformation  may  be  imple¬ 
mented  (for  the  case  considered)  by  an  interactive  flow-analysis  with  linear-operator  manipulation 
and  extended  Presburger  inference  capabilities. 


52.3  Reducing  Processor  Interconnection  Degree 


2.3.1  Problem  Statement 
Given  a  program  statement 


PROCESSORS  PNAMEpbv  PITER  . . .  HEARS  PNAME,tBV  HITER  (1) 


where 


PBV  =  processor  bound-variable  list, 

HBV  =  H3V{PBV,k), 

k  =  bound  variable(s)  not  in  PBV  iterated  by: 

HITER  =  HITER  [PBV,  n,k),  iterator  over  * 


Define  F=F(n)={PBV:PITER(F3V,n)},  the  procutor-family  (index  set)  smd 
H b):PITER(a,  n)  A  b=*h’BV[a,k)  A  HITER[a,n,k)}  the  Heart  relation  of  (1). 
Define  Htl  the  processors  QEARd  by  -.. 


H.={i:H't} 


SO  2.  Inp*rinc«  Riquirimxnti  Analysis  and  Implimintation  Proposal 

Recall  now  the  definitions  of  ‘telescopes*  and  ‘snowballs’: 

►  H  telescopes  if  either  C  Ht,  lit  S  Ha,  or  O  Hj=0,  i.e.  ,  V  a,  6  6  F.lit  D  lit  G{0,  Ha.,  Ht) 

►  H  snowballs  if  it  telescopes  and  Va,  b,  z  E  f'.[0  C  Ha  C  Ht  A  HaU  (z}=Ht]  =*  ]z=a]* 

If  H  snowballs  then  HITER[PBV ,n,k)  in  (i)  is  'reduced*  by  setting  k—k0  where 
HITER{PBV,n,ka)  and  H BV(PBV ,  k0)  is  the  index  in  Hpbv  “closest*  to  PBV  (using  sum  of 
absolute  coordinate-differences  as  metric). 


2.3.2  Example: 

An  application  of  M  AX  E-US  ES-H  EARS  in  [King-82]  generates  a  statement 

PROCESSORS  Pli(n,,  1  <  m  <  n,  1  <  /  <  n-m  +  1 
if  2  <  m  <  n  then 

HEARS  P,.*,  l<Jk<m-l  (a) 

HEARS  Pl+*,m_*,  1  <  *  <  m— 1  (b)  (2) 

Clauses  (a)  and  (b)  each  generate  snowballing  Hears-relations,  and  are  reduced  respectively  to 


HEARS  Pi,m— i  (a) 

HEARS  Pr-M.m-t  (b) 


It  may  be  helpful  to  illustrate  the  resulting  pattern  for  the  case  n=5  (b): 

*3#«  tb«  aott  it  tb«  tad  of  Ibis  Socttoa. 
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Figure  7.  HEARS  clause  (26) 

Notice  that  in  (a)  the  clause  is  reduced  by  setting  k—m—  1  whereas  in  (b)  it  ;s  reduced  by  setting 
i  =  l.  Both  clauses  can  be  effectively  normalised  so  that  the  solution  will  be  to  set  k  =  maximal  value 
(below). 


1.3.3  Remarks  on  'General  Theorem-Proving  Approach’ 

Without  constraints  on  (1)  the  snowballing  property  can  be  quite  intractable.  Even  if  PITER,  HBV, 
and  HITER  are  constrained  so  that  only  extended-Presburger  formulas  result,  the  problem  may  be 
intractable  without  additional  constraints  and/or  expertise  on  the  Presburger  problem-domain. 

Thus  given  (2b),  we  would  extend  a  Presburger  arithmetic  basis  (or  specialised  prover)  with  pairing 
axioms 


hd{z,y)=z,  tl(x,y)=y 
Integer(z)  V  ( hd(z),tl[z))=z 


Then 


F(u)  »  1  <  t/(u)  <  n  A  1  <  hd(u)  <  n — ti(u)  4-  1 
<=»  t<(«0=tf(u)  +  M(i»)—M(«0 

A  1  <  hd(v)—hd(u)  <  tl(u) — I  <  r» 

A  1  <  hd[u)  <  n — tf(u'  ■+■  1 


i 

I 

t 


are  derived  following  (1). 

To  prove  Telescopes  (H)  we  assume  not,  for  some  a,  6,  g  F(n),  and  derive  a  contradiction: 


2.  iNfERENCE  REQUIREMENTS  ANALYSIS  AND  IMPLEMENTATION  PROPOSAL 


F[a)  A  F(b)  A  F(3)  A  F(F) 
h"  HHb^O 

H*l  A  ~  Htl  }  W.  2  H» 

H„  A  ~  HaS  }  Hi  2  H, 


false  }  via  Integer-Arithmetic,  Pairing  Axioms 

To  prove  Snowballs  (H),  we  assert  its  negation  for  some  a,b,c,d  in  F(n): 


F(o)  A  F(b)  A  F{e)  A  F{d) 

H,#1  V  Hy,,  V  Hy,,  V  Ht*,  V  Hztt  V  '  '  Hy,, 
Had 

~  H„  V  Hi, 

~Hu 

~  H»,  V  (H„)  V  [*=e] 
o  ^  c 


false 

where  the  second  clause  asserts  “snowballs*  and  *,  V, *i, *a> *s  »re  universally  quantified  variables. 
These  axioms  are  in  a  form  which  can  be  given  to  the  LMA  prover  [OverLusk-80]. 

We  expect  that  specialised  knowledge  of  extended-Presburger  arithmetic  decision  procedures  and 
integer  programming  will  be  required  (at  least)  for  success  of  so  direct  an  approach  to  this  class  of 
problems.  Another  approach  is  to  further  constrain  the  problem  without  excluding  the  common 
cases  of  interest. 


2.3.4  Heuristic  Constraints 

Notice  that  snowballing  HEARS  clauses  define  “one-dimensional"  transitive  relations  over  F  -  e.g. 
,  the  two-dimensional  HEARS  clause 


HEARS  l  <  f  <  1  +  {m-m') 


which  “merges'  (a)  and  (b)  of  (2)  does  not  satisfy  the  “snowballs”  predicate.  Indeed  its  “reduction’ 
would  result  in  0(nJ)  processors  sending  data  through  two  asymptotically  hot  wires.  Thus  we  lose  no 
generality  in  constraining  HITER  (1)  to  iterate  a  single  parameter  (k)  over  a  finite  integer  subrange 
dependent  on  PBV , n: 


HITER  =  [L(PBV,  n)  <  k  <  U{PBV,  n)] 


(3) 


Another  plausible  constraint  is  that  each  one-dimensional  subfamily  of  a  snowballing  HEARS  be  a 
“linear"  subset  of  the  lattice  points  over  which  HBV  ranges  -  e.g.  ,  for  PBV  fixed, 
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HBV[PBV ,  k)  is  linear  in  k 


Equivalently,  the  first  differential  in  k 


HBV{PDV,  k  +  1  )-HBV(PBV,  k) 
is  independent  of  k.  Indeed,  we  And  plausible  the  stronger  constraint, 


(5)  is  constant  ( independent  of  both  k  and  PBV )  (6) 

After  all,  if  (5)  varies  with  PBV  then  distinct  coliaear  H-subsets  of  F  have  diiTerent  slopes  and  are 
“likely"  (though  not  required)  to  intersect,  in  violation  of  the  telescopes  constraint. 

These  constraints  yield  a  normal  form  for  each  "linear  snowba!l*(Figure  3): 


HEARS  PNAME  ,•(».„)+*  C,0  <k<  L{:,n) 
where  C  is  a  constant  vector  (the  slope)  and 


z=F(z,  n)  -i-  L(z,  n)  ■  C  (8) 

where  F{z,  n)  is  the  most-distant  HEARd  point  and  k=L(z,  n) — 1  selects  the  nearest  HEARd  point 
(in  taxicab  metric:  sum  of  absolute  coordinate  differences): 


\z'=F{z,n)  +  kC  (k—Z) 

^  FU,  n) 

Figure  8.  A  Linear  Snowball 

Note  that  F{z,  n)=F{z‘,  n)  for  each  z‘  on  the  line;  thus  F(z,  r>)  yd  F{z‘ ,  n)  implies  H,  IT  H',=0. 
3.3.5  Example.  The  HEARS  clauses  of  Example  2.3.2  have  normal  forms: 

(а)  HEARS  R(i,i)+»  (o,i)>0  <  k  <  rn — 1 

(б)  IIEARS  Jfy+m— i,i)+*.(— i,t),0  <  k  <  — 1 


2.3.6  Linear  Snowball  Recognition-Reduction  Procedure 

Given  HEARS  clause  (1)  with  HITER  as  in  (3): 

Step  1.  Verify  (6) 

Step  8.  Put  (1)  in  normal  form  (7) 


Si 
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Step  3 . 

Verify  (8) 

Step  4. 

Verify  (9)  (for  0  <  it  <  L(z,n)): 

F[{Fl*,n)  +  k-C),n)=F{z,n) 

(9) 

Step  5. 

Reduce  (7)  to  (10): 

HEARS  P N AM Er(z ,N)+mz ,N)—\)  c 

(10) 

Failure  of  any  verification  attempt  above  implies  return  with  failure  (i.e.  ,  the  REDUCE-HEAR 
rule  does  not  apply).  This  procedure  suggests  a  refinement  of  King’s  rule  to  two  rules,  a 
NORM  r\LlZE- HEARS  rule  which  tests  (6),  and  a  REDUCE-NORM All Z ED-H EARS  rule 
which  implements  the  remainder  of  this  procedure.  | 


2.3.7  Correctness  and  Complexity  of  REDUCF.-HEARS  Refinement 

The  constraints  (3)-(6)  can  be  tested  in  linear  time,  provided  that  H BV(PDV ,  k)  contains  no 
non-linear  symbolic  expressions  in  (PBV.k).  (Note  that  a  linearity  claim  must  exclude  perverse 
specifications  such  as  T(n )  x  PBV( l)s  x  k3  where  T(n)  is  some  arbitrary  arithmetic  formula  which 
eventually  simplifies  to  lero.) 

Given  similar  non-perverse  linearity  constraints  on  L{PBV  ln),U(PBV  ,n)  of  (3),  we  assert  linearity 
of  the  normal-form  conversion  (7).  Condition  (8)  is  a  consistency  test;  it  distinguishes  the  linear 
snowball  F(z,  n)-f  i  C  from  the  non- snowballing  HEARS  index  F(z,  Certainly 

it  is  conceivable  that  F(z,  n)  and  L[z,  n)  might  contain  symbolic  constants  whose  values  would  decide 
truth  or  falsity  of  (8);  in  this  event  REDUCE-NORM AL1Z ED-HEARS  should  admit  failure  and 
ask  the  user  what  is  going  on.  (Thus  far  we  have  no  experience  with  such  specifications). 

Now  (9)  is  precisely  what  we  need  (given  (8))  to  stse rt  that  H  telescopes: 

H,  n  H',=0  «=>  F(z,  n)  yt  F(ar\  n); 

H,  n  H'.  e{H„  H'.}  fes  F(z,  n)=F(z,  n). 

Again  its  verification  (under  the  non-perversity  assumption)  requires  only  a  linear-time  simplification 
of  a  symbolic  linear  expression;  the  constraint  that  k  <  L[z,  n)  has  nothing  to  do  with  its  truth  or 
falsity. 

To  conclude,  the  snowball!  antecedant  (2)  now  reduces  to 

(F(a,  n)  +  k  ■  C:0  <  *  <  L(o,  n)>  U{*} 

={F(i,n)  +  *C:0  <  *  <  L{b,  n)>, 

which  implies  L(b,n)=L{a,n)  +  1  by  telescoping  [F(a,n)—F(b,n)).  Therefore 

z=F(a,  n)  +  L(a,  n)  ■  C—a 
by  (8),  as  required.  We  have  proved  the  following: 

Theorem  2.1.  If  Procedure  2.3.6  returns  tuccettfully  with  reduced  HEARS  clause  (10)  then  it  it  a 
reduction  of  the  ( linear )  tnowballing  HEARS  clause  (1).  | 


§2.4  Conclusions 

Significantly,  Procedure  2.3.6  does  recognise  the  class  of  snowballs  thus  far  encountered  (and  which 
we  expect  to  encounter)  in  linear  time,  instead  of  the  super-exponential  (worst-case)  time  which  we 
might  initially  fear  for  the  unconstrained  theorem-proving  approach  of  §  2.3.3.  Both  this  and  the 
inferred  conditiont  problem  illustrate  the  important  heuristic  of  restricting  the  problem  domain  so 
that  simple  procedures  can  be  applied. 


Note 


The  REDUCE-HEARS  analysis  is  based  on  a  somewhat  less  refined  (and  earlier)  definition  of 
“snowballs'  than  the  one  used  in  Section  1  Under  the  heuristic  constraint  of  5  2.3.4  the  two  concepts 
are  equivalent  R.  King  provided  a  discriminating  example: 


F={0.1,...,n} 

k):0  <  k  <  2 


A  /  <  r>} 


It  snowballs  according  to  Section  2  but  not  according  to  Section  1 .  It  violates  the  heuristic  constraints 
of  5  2  3.4  because  2[i/2J  is  not  a  linear  function  of  l.  That  it  can  be  made  into  a  snowball  according 
to  Section  2  by  adjoining  n/2  additional  HEARS  edge3  (“rounding  and  reducing”)  suggests  that 
both  definitions  merit  consideration.  A  sequel  to  this  report  will  present  a  simplified  analysis  in 
terms  of  the  more  refined  definition. 
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